Deep learning approach towards accurate state of charge estimation for lithium-ion batteries using self-supervised transformer model

Accurate state of charge (SOC) estimation of lithium-ion (Li-ion) batteries is crucial in prolonging cell lifespan and ensuring its safe operation for electric vehicle applications. In this article, we propose the deep learning-based transformer model trained with self-supervised learning (SSL) for end-to-end SOC estimation without the requirements of feature engineering or adaptive filtering. We demonstrate that with the SSL framework, the proposed deep learning transformer model achieves the lowest root-mean-square-error (RMSE) of 0.90% and a mean-absolute-error (MAE) of 0.44% at constant ambient temperature, and RMSE of 1.19% and a MAE of 0.7% at varying ambient temperature. With SSL, the proposed model can be trained with as few as 5 epochs using only 20% of the total training data and still achieves less than 1.9% RMSE on the test data. Finally, we also demonstrate that the learning weights during the SSL training can be transferred to a new Li-ion cell with different chemistry and still achieve on-par performance compared to the models trained from scratch on the new cell.

extensive domain knowledge, rigorous feature engineering, and relatively long development time 16 . Apart from that, model-based approach also does not scale well across battery cells different chemistry. As a result, alterations in cell chemistry requires a re-development of the model 17 . Additionally, model-based approach also does not account for anomalies in cells such as manufacturing inconsistencies, unpredictable operating conditions, cell degradation, and so forth 18 . Due to these shortcomings, more researchers are shifting their attention to using the data-driven approach for SOC estimation. In this approach, the SOC is directly modeled from observable signals such as voltage, current and temperature of the Li-ion battery cell sampled over diverse operating conditions across different cell chemistry and manufacturers 19 . There are various methods in data-driven SOC estimation such as artificial neural network 20 , support vector machine 21 , extreme learning machine 22 Gaussian process regression 23 , wavelet neural network 24 , nonlinear autoregressive with exogenous input neural network 25 , optimized neural network 26 and fuzzy logic 27 to name a few. One method that has been gaining traction lately is the use of a data-driven technique known as deep learning (DL) 28 . DL has great potential for SOC estimation due to its powerful capability to learn any function given the right data according to the universal approximation theorem 29 . In essence, DL can be used to directly approximate the relationship between the measurable cell signals (voltage, current, temperature) and the SOC with no additional processing such as using adaptive filters 30 . This eliminates the needs of manual feature engineering which can take a considerable amount of time and expert domain knowledge and still produce accurate SOC estimation results 31 . Pioneering works by authors 32,33 introduce long short-term memory (LSTM) and deep neural network DNN to directly estimate SOC from cell voltage, current and temperature with no additional filters. The proposed model achieved lowest MAE of 2.17% over a varying ambient temperature dataset. Authors in 34,35 proposed the utilization of gated recurrent unit (GRU) model to directly estimate SOC over a wide ambient temperature range with a RMSE of 3.5% under untrained temperature. There are also authors who proposed deep convolutional models such as in 36,37 with approximately less than 3% MAE on untrained data. Nonetheless, there are works on hybridizing convolutional and recurrent models such as 35,38 with under 2% RMSE on varying ambient temperature.
However, there are several research gaps with the existing DL methods for SOC estimation. These research gaps are the primary motivations of the proposal in this article. Firstly, all the cited works uses the supervised learning (SL) scheme to train the models which is known to require massive amount of data to accomplish 39 . Even in the scenarios where adequate data is available the training time for DL models typically takes many hours or days to complete 32 . Secondly, the models that are trained on one cell chemistry do not apply to other cell chemistry. Even though preliminary work indicates that transfer learning is possible 40 , further tests are still required to verify its accuracy if it applies to more cells with differing chemistry. In most cases a model that is trained on one Li-ion battery cell data does not generalize well across another cell and may require re-training of the model from scratch. Thirdly, most DL models use the recurrent DL architecture which may prove to work well with sequence data such as the SOC but are hard and slow to train 41 . Recurrent models also do not leverage on the parallel GPU computation that could significantly improve training time 25 . Lastly, even though recurrent architectures such as the LSTM or GRU can handle long sequences well, they are still susceptible to vanishing gradient for longer sequences 42 . Due to these limitations, recurrent models are generally superseded by another architecture known as the Transformer in many domains such as computer vision 43 and natural language processing 44,45 . In short, the primary motivations for the proposal in this article are (i) shortcomings of the SL training framework, (ii) Inadequate validation and testing of transfer learning capabilities in DL models, (iii) shortcomings of the DL recurrent architectures and (iv) Emergence and success of the Transformer DL model in other domains.
In this article, we introduce a new DL architecture for SOC estimation known as the Transformer. Apart from that, we propose a training framework that leverages on self-supervised learning (SSL) 39,46 and make it possible to train the Transformer on scarce amounts of Li-ion data in a short time and achieve higher estimation accuracy compared to models trained with conventional fully-supervised method. With the proposed framework, we demonstrate that the learned parameters from one cell can be transferred to another by fine-tuning the model on little data with very short training time (approximately 30 min on a GPU). This proposed framework also incorporates various recent DL techniques such as using Ranger optimizer with learning rate finder, time series data augmentation, and Log-Cosh loss function to boost the accuracy of the Transformer. Finally, we conclude the study by comparing and validating the performance of our proposed model to other recent DL architectures for SOC estimation. The key contributions of this study are as follows: • We introduce the transformer DL architecture for end-to-end SOC estimation with no feature engineering or adaptive filtering. • We propose the SSL training framework to train the proposed architecture in a very short amount of time and achieves improved estimation accuracy compared to conventional training framework. • The proposed model's parameters are transferable to a different cell type and requires only five training epochs to achieve RMSE ≤ 2.5%. • The SSL training framework enables the proposed model to be re-trained with as few as 20% of the total training data and still achieve lower error compared to the models that do not use the SSL framework. • We evaluate and validate the performance of the proposed model to other state-of-the-art DL architectures for SOC estimation.

Results
In the first two subsections, we highlight the estimation accuracy of the proposed model trained at room temperature and varying ambient temperatures and compare the estimation robustness against other DL models. In the subsequent sections, we study the influence of pre-training on the model and show that the pre-trained model www.nature.com/scientificreports/ outperforms the non-pre-trained models in estimation accuracy and convergence time. Next, we highlight that the learned weights of the pre-trained model can be transferred to perform estimation on a different cell with different chemistry with short re-training or fine-tuning. In all experiments, the proposed Transformer model was benchmarked against other widely used DL models for SOC estimation of various architecture types. The hyperparameter combinations of all comparison models were chosen to be as close as possible to the original publication. If that is infeasible, we adopted the hyperparameter configurations that is widely accepted or used and has proven to work well on many occasions. All common hyperparameters such as sliding window value, input output units, learning rate, batch size, training epochs were held constant. All comparison models were trained with the SL training framework and only the proposed Transformer model was trained with the SSL training framework under the same battery materials and EV drive cycle dataset.
Estimation accuracy under constant ambient temperature. In this section, we demonstrate the estimation accuracy and efficacy of the proposed model based on data sampled at room temperature. The error metrics of all models are tabulated in Table 1, sorted in ascending order of the test error. We found that the proposed model performance on SOC estimation accuracy of the RMSE in terms of training, validation and testing are 1.1087%, 0.8661%, 0.9056% and the MAE are 0.3289%, 0.4059%, and 0.4459%, respectively. It is observed that the proposed model outperforms all other models on the test dataset of RMSE and MAE tabulated in Table 1. For comparison, a baseline Transformer model that was trained using conventional training framework is included. Results indicate that the baseline model only scores abysmally compared to other models except the DNN, suggesting that the training framework plays a pivotal role in the robust performance of the proposed model. We observe that the recurrent models (GRU and LSTM) outperformed their convolutional and hybrid counterpart, which is not surprising as the recurrent models are specifically designed to handle sequence data well. Both the GRU and LSTM were configured to use single hidden layer with 100 neurons. Despite the compromise in estimation accuracy, convolutional models may be advantageous in training complexity compared to the recurrent models, as they are relatively easier to train and could better utilize GPU parallelization. The Resnet model is based on the Residual Network architecture 47 adapted to work sequential data 48 . Inception Time is based on the implementation in 49 which has shown superior benchmark performance in sequential data. ResCNN consist of convolutional neural network with Residual blocks as inputs. FCN consist of only convolution operation with no pooling operations has shown promising performance in 50 . GRU-FCN and LSTM-FCN are the hybrid models combining recurrent and convolutional models to obtain the advantages of both. The estimation plot on the test dataset consisting of US06, LS92 and UDDS drive cycle is shown in Fig. 1.

Estimation accuracy under varying ambient temperatures.
In this subsection, we compare the estimation accuracy of the proposed model to other widely used DL models of various architectures. Among all DL architectures compared in the study, the proposed transformer model achieved the lowest RMSE of 1.1075%, 1.3139% and 1.1914% and MAE of 0.4441%, 0.5680% and 0.6502% on the test drive cycles outperforming even the recurrent models which has been widely used for SOC estimation as shown in Table 2. We also note that the convolutional models such as the Resnet 40 and the Inception Time 51 also outperformed the conventional GRU 41 and LSTM 52 model. The baseline Transformer model that is not trained with the proposed training framework scores poorly along with the feedforward DNN. Figure 2 and Fig. 3 illustrate the SOC estimation plots of the proposed model across all test drive cycles at above and below zero • C ambient temperature, respectively. Traditionally, SOC estimation under low ambient temperature settings are extremely challenging due to the difference in the dynamics of chemical reactions in the cell 53 . However, observation in the estimated SOC by the proposed Transformer model shows promising results indicating its robustness in estimating SOC at extreme cold temperatures up to − 20 °C. www.nature.com/scientificreports/ Influence of pre-training. Size of training data. In this section, we investigate the influence of unsupervised pre-training on the amount of required data to train the proposed model with a low error rate. We divided the experiment into three scenarios. In the first scenario, we pre-trained and re-trained/fine-tuned the model with all (100%) available training data. The second and third scenarios were with 50% and 20% of the training data, respectively. In all scenarios, we noted the training time and error metrics. All training in this section was only performed for only 5 epochs to illustrate the short amount of training time required to achieve low error rates. There were three modes of training used in this setup namely the pre-training (PT), re-training (RT), fine-tuning (FT), and full training (T). In PT, the model was trained on unlabeled dataset with unsupervised learning. In RT, the mode was trained on a labeled dataset with supervised learning. In FT, the all the weights of the model were frozen except for the last layer and trained with supervised learning. In T, the model was trained from scratch with supervised learning. Table 3 shows the results obtained. We observe that when we pre-trained and re-trained the model on all available data (row 1), the error metric is lower compared to the model that was not pre-trained (row 3). Both took approximately the same duration of training time. We observe that in the event where we pre-trained and fine-tuned the model (row 2), even though the model only updates the weights of the final layer, it still scores a respectable 2.10% test RMSE with a reduction of about 10 min training time compared to the previous two modes. The effect of pre-training is even more pronounced in the second scenario where we only re-trained and fine-tuned the models on 20% of train data. At approximately the same amount of training time, the pre-trained model (row 7) scores lower on the non-pre-trained model (row 9). In this section we show that pre-training helps in reducing the test error with approximately same amount of training time especially when there is very little training data.
Transfer learning. In this section we investigate the role of unsupervised pre-training in helping the model to generalize its estimation capacity across different cell chemistry. We pre-trained the model on the LG 18650 LiNiMnCoO 2 cell and tested the model's estimation capacity on a Panasonic 18650 LiNiCoAlO 2 cell, similar to the cells used in some Tesla vehicles. Table 4 shows the performance of the proposed model with no changes in the model architecture.  www.nature.com/scientificreports/ We divided the experiment into four scenarios. In the first scenario, we pre-trained and re-trained/fine-tuned the model with all (100%) available training data. The second and third scenarios were with 50% and 20% of the training data, respectively. In the fourth scenario we pre-trained and re-trained the model on the same cell type with all available data shown in the last two rows of Table 4. Unsurprisingly, the best performing mode is when the model pre-trained and re-trained on the Panasonic cell and the worst performing model is when the model was pre-trained and re-trained on the LG cell. However, when the model was pre-trained on the LG cell and re-trained on the Panasonic cell, the test error rate is almost on par with the best performing model. This suggests that the pre-training helps in downstream re-training despite the difference in cell type. In the scenario when the model was trained on less data (20% of the training data), we observe that without pre-training (row 9) the model yields high test errors. In rows 7 and 8, we observe that the error rate is reduced by pre-training the model, even on a different cell. This once again is evidence that pre-training contributes to minimizing the test set error regardless of the cell type. Supplementary Fig. 2 illustrates the estimation of the worst performing mode. Despite being trained on a different cell type, the model still can capture the trend of the ground truth SOC value. With pre-training on the LG cell and re-training on the Panasonic cell, model can estimate the SOC more accurately as shown in Fig. 4.
In this section, we showed that the weights of the model that is learned during the unsupervised pre-training phase can be reused in re-training or fine-tuning across different cell types. This opens the possibilities of transfer learning which is extremely helpful especially when data and computational resource is scarce. On a side note, all re-training and fine-tuning in this and the previous section was only performed for 5 epochs to showcase the learning capability of the model despite a small training epoch. Re-training and fine-tuning the model for more epochs will likely yield better performance.

Discussion
In this work, a Transformer-based SOC estimation model in combination with the SSL framework was developed to address the challenges on Li-ion cell data availability, transfer learning, training speed and model accuracy. We show that the proposed model can achieve the lowest RMSE and MAE on the test set at various ambient   The third contribution of this work is to demonstrate that the weights from the encoder layers of the Transformer learned using the unsupervised pre-training phase can be readily transferred to another cell type of a  www.nature.com/scientificreports/ different chemistry. Additionally, with only five epochs of re-training, the model can achieve RMSE ≤ 2.5% on the test set. Extending the training time further likely leads to further reduction in RMSE. However, this work shows that even with lightweight re-training, the weights transferred from pre-training significantly contribute to the short training time with significantly less data. This opens the possibility of adapting the Transformer model to other types of Li-ion cell with only a fraction of the training data. The fourth contribution highlights the SSL training framework that enables the proposed model to be retrained with as few as 20% of the total training data and still achieve lower error compared to the models that do not use the SSL framework. Despite the reduced amount of training data, the model still generalizes well across different cell chemistry. This further accentuates the important role of unsupervised pretraining in allowing the model to be more data efficient.
The fifth contribution compares and validates the performance of the model against recent state-of-the-art DL models on SOC estimation. It is shown that the model clearly outperforms all other models in the RMSE and MAE metric on the test dataset. Given the efficacy of the model in SOC estimation accuracy, transfer learning capability, data-efficiency and training speed, the proposed Transformer model and framework is evidently advantageous over other DL models.

Methods
Dataset. In this study, we utilized raw data sampled from a brand-new cylindrical 18,650 LiNiMnCoO2 cell by LG which was made available by the McMaster University in Hamilton, Ontario, Canada 54 . The specification of the cell is given in Supplementary Table 1. The data was collected by subjecting the battery cell to various EV standard drive cycles such as UDDS, HWFET, LA92, US06 at varying ambient temperatures ranging from − 20 to 40 °C. In addition, to simulate the dynamics of driving conditions, the cell was also subjected to a random mix of the standard drive cycles. The division of train, validation and test dataset used in this study is specified in Supplementary Table 3. Figure 5a illustrates a sample plot of the UDDS drive cycle from the test dataset at − 20 °C ambient temperature. For DL models to work well, careful consideration is put in preprocessing the raw data samples. Firstly, the raw data is normalized into a range of 0 to 1 using Eq. (1). www.nature.com/scientificreports/ Next, the raw data was divided into three separate sets namely train, validation, and test set. The train set was used to train the model, validation set to check the generalization of the model during training and the test set was only used to evaluate the model at the end of training. The division of the train, valid and test set is tabulated in Supplementary Table 3.
Using supervised learning requires the dataset to be formatted into input-label pairs. In this study the input was the normalized values of voltage, current, and temperature while the label corresponds to the SOC value. The input-label pairs were constructed by running a sliding window across the time axis over the train, validation, and test set as illustrated Fig. 5. Voltage, current and temperature values that resides in the green window corresponds to the input and SOC value that resides in the purple window corresponds to the label. Note that the width of the window, k was kept at k = 400 timesteps.
Having massive amounts of data is a crucial component in training the proposed model without which the model will fail to generalize well 55 . Although the raw data already consists of hundreds of thousands of timesteps, it is still insufficient for the proposed model to work. Hence, the original dataset was augmented in various way by injecting random noise (µ = 0, σ = 0.33) onto the training and validation dataset. The types of noise injected includes additive and multiplicative Gaussian noise on the magnitude and random frequency noise generated with the wavelet decomposition method. Figure 5b shows a sample of the original and the augmented version of the plot.
Transformer model architecture. The original Transformer proposed in 56 uses the encoder-decoder arrangement in the architecture. The model proposed in this work only adopts the encoder portion and not the decoder as detailed in 57 to work better with multivariate time-series data. Figure 6 illustrates the block diagrams of the proposed model depending on the training stage which will be detailed more in the next section. Observe that in Stage 1 and Stage 2 there are several common blocks namely, input, x positional encoding, encoder stack, and linear layer. The input data to the model consists of the input vector of X = [V k , I k , T k ] representing the cell voltage, current and temperature at timestep, k. To give the input data contextual information, the input vector is then added to the positional encoding. There are various choices of positional encodings according to 58 . However, in this work we use the learned positional encoding with the sine and cosine functions as shown in Eqs. (2) and (3).  www.nature.com/scientificreports/ where pos and i correspond to the position and dimension, respectively. The reasoning behind using both functions has been previously detailed in 56 . Once tagged with the positional encodings, the input vector is passed through a series of encoder blocks. Core to the Transformer architecture is the multi-headed self-attention (MHSA) module inside the encoder block. The MHSA module applies self-attention to the input sequence with respect to the output sequence. As shown in the figure, the input to the multi-headed self-attention module are the key, K, value, V and query, Q. In the MHSA block, it attempts to map the query to a set key-value pairs with respect to an output to produce the attention matrix. The operation consists of a dot product of the query, Q with all keys and a division by d k and applying a softmax function over the result as given in Eq. (4) where dk is the dimension of the keys.
Instead of performing the operation in Eq. (4) once to produce a single matrix, the operation can be repeated multiple times in parallel and the resulting matrices can be concatenated into a larger matrix as show in Eq. (5).
In this work we use the number of parallel attention heads, h = 16. Each head uses d model = 128. Also used in the encoder module is the residual connection and residual dropout, which was set at dropout probability, p = 0.1.
The remaining model hyperparameter values are in Supplementary Table 4. Since Transformer models processes the entire sequence and does not account for the order of the input, a positional encoder is required to add contextual information. The positional encoder generates a unique encoding for each data point in the input vector and can generalize to longer sequences. Shown in Fig. 5c is the learned positional encoding generated for the input dataset in this study which consists of voltage, current and temperature of the Li-ion cell. The x-axis corresponds to the lag window in the input dataset which is used generate the dataset.
Training framework. Instantiating a DL model involves various stochastic processes. To ensure the reproducibility and consistency of the results obtained, all experiments were conducted using a preset seed value. Referring to Fig. 6, model training was divided into two distinct phases, namely the unsupervised pre-training and downstream fine-tuning. In the unsupervised pre-training stage 59 , unlabeled vectors of input sequence, X was used to train the model. Part of each input sequence values were randomly set to 0 by performing elementwise multiplication with a binary mask, M. The corrupted input, X˜ was generated with the X˜ = M 0 X. The model was then required to reconstruct the masked input with a modified MSE loss function, as given in Eq. (6).
where x is the predicted input vector values and x is the un-corrupted input vector values. Note that the loss does not require the model to reconstruct the entire input sequence but only elements in the mask, M. Upon completion of the unsupervised pre-training phase, the weights of the model save were transferred for the downstream fine-tuning phase. In this phase, the model was re-trained on a labeled dataset with supervised learning. The loss function, used in this phase is the hyperbolic cosine (Log-cosh) loss function as given in Eq. (7).
where y is the ground truth and yˆ is the predicted value by the model. The RMSE [Eq. (8)] and MAE [Eq. (9)] error metric was used to evaluate all models.
One of the most important hyperparameter used in training DL models is the learning rate, α (LR) 60 . To search for the optimal LR range of values, we employ the use of LR finder introduced in 61 . The optimal LR found with the LR finder is α = 1e3 as depicted in Supplementary Fig. 2. The LR value was used in conjunction with the Ranger optimizer which is a synergistic combination of Rectified Adam (RAdam) 62 and Lookahead optimizer 63 . RAdam has been shown to stabilize the training at the start and Ranger stabilizes the convergence in the remaining steps 64 . The Ranger optimizer is configured with momentum = 0.95, weight decay = 0.01 and epsilon of 1e −6 . This combination has been shown to achieve state-of-the-art results on many datasets 65,66 . As the training approaches the end, the LR is decayed to a lower value to further facilitate convergence to a global minimum 67 . The LR is decayed for each batch as follows,  Implementation. All models studied were trained on an Ubuntu 20.04.02 LTS Linux operating system with Intel Core i7-4790 K CPU at 4.00 GHz clock frequency, 32 GB of RAM and a Nvidia GeForce RTX3090 graphic processing unit. All DL models were built using the open source Pytorch 1.7.1 68 framework in tandem with the TSAI library 69 . The implementation of the proposed Transformer model and SSL training framework was divided into several steps. In Step 1, two distinct datasets from the LG LiNiMnCoO2 cell (Supplementary Table 1) and Panasonic LiNoCoAlO2 cell (Supplementary Table 2) were downloaded. Both dataset consist of data sampled from respective cells over a diverse range of temperature and drive cycles to simulate dynamic operating conditions as elaborated in Sect. 3.1. The dataset was divided into train, validation and test sets as shown in Supplementary Table 3. Next the data was normalized into the appropriate range (0 to 1) and pre-processed with sliding window of lag, k = 400 timesteps (Fig. 5b). The sliding window lag value, n is arbitrarily selected to due to the limits in our computing resources. Given more computational resources, k can be made larger to allow the model to consider more contextual information from the past. At this point the dataset was also augmented by injecting additive and multiplicative Gaussian noise. Finally, the dataset is transformed into the positional encoding form shown in Fig. 5c. This is the format of the data that is expected by the transformer model. In Step 2, the Transformer was configured using the appropriate model hyperparameters as detailed in Supplementary Table 4. Careful attention is placed on the dropout hyperparameter value as it largely influences the degree of overfitting on the dataset. We find that the settings of dropout in the feedforward layer to p = 0.2 and dropout in the residual layer to p = 0.1 work well in our experiments. In Step 3, the model is now ready to be trained. As illustrated, the model was trained in two distinct stages in the order of unsupervised pretraining and then downstream retraining. In "Estimation accuracy under constant ambient temperature", the model was trained on the LG dataset and was evaluated on its estimation accuracy at fixed and variable ambient temperature settings. In "Estimation accuracy under varying ambient temperatures. ", the model was trained on the LG dataset and tested on its estimation accuracy on the Panasonic dataset. In this step, the training hyperparameter was configured as detailed in Supplementary Table 5. Careful attention was placed on setting the LR in both training phase as it largely determines the performance and convergence to a global minimal. We rely extensively on the use of a LR finder algorithm which points us to setting the LR, α = 1 × 10 -3 for pretraining and α = 2 × 10 -4 for retraining.
In Step 4 the model was evaluated on the SOC estimation accuracy in "Estimation accuracy under constant ambient temperature" and the influence of pretraining and SSL on the performance of the proposed model in "Estimation accuracy under varying ambient temperatures.". The performance of the model was quantified with the RMSE and MAE performance metrics. Finally, the performance of the model is compared to various widely used DL models on similar performance metrics.

Data availability
The data that support the findings of this study are available from the corresponding author upon reasonable request.

Code availability
The software code and the examined cases that validated our method are available from the corresponding author upon reasonable request.