Deep learning based CETSA feature prediction cross multiple cell lines with latent space representation

Mass spectrometry-coupled cellular thermal shift assay (MS-CETSA), a biophysical principle-based technique that measures the thermal stability of proteins at the proteome level inside the cell, has contributed significantly to the understanding of drug mechanisms of action and the dissection of protein interaction dynamics in different cellular states. One of the barriers to the wide applications of MS-CETSA is that MS-CETSA experiments must be performed on the specific cell lines of interest, which is typically time-consuming and costly in terms of labeling reagents and mass spectrometry time. In this study, we aim to predict CETSA features in various cell lines by introducing a computational framework called CycleDNN based on deep neural network technology. For a given set of n cell lines, CycleDNN comprises n auto-encoders. Each auto-encoder includes an encoder to convert CETSA features from one cell line into latent features in a latent space \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbb {Z}$$\end{document}Z. It also features a decoder that transforms the latent features back into CETSA features for another cell line. In such a way, the proposed CycleDNN creates a cyclic prediction of CETSA features across different cell lines. The prediction loss, cycle-consistency loss, and latent space regularization loss are used to guide the model training. Experimental results on a public CETSA dataset demonstrate the effectiveness of our proposed approach. Furthermore, we confirm the validity of the predicted MS-CETSA data from our proposed CycleDNN through validation in protein–protein interaction prediction.

thermal denaturation, as evidenced by similar melting curves over the temperature range 13 .However, the extent of relatively accurate correlations between CETSA features and PPIs has not been systematically investigated.
Despite its utility, there are still significant barriers to the large-scale application of MS-CETSA.A key challenge is the reliance on time-consuming and resource-intensive biological experiments to obtain protein melting curves of proteins for each cell line of interest.Generating complete MS-CETSA datasets across multiple cell lines is almost infeasible, given the current depth of mass spectrometry measurement.While some proteins are common across cell lines, others may only be present in specific contexts.To overcome this bottleneck, we develop a computational approach for predicting CETSA features across cell lines based on limited experimental data, as shown in Fig. 1.By extrapolating from one cell line to others, this predictive modeling aims to dramatically reduce the experimental burden and enable broader applications of MS-CETSA methodology.
In the field of machine learning, deep neural networks have driven revolutionary advances in various areas 14 , especially in computer vision.An important topic in computer vision is image-to-image translation, where the style of one image is transferred to another.Two influential models are pix2pix 15 and CycleGAN 16 , which can perform robust image translation across different domains while preserving key texture and content attributes.These models inspire the developing of similar techniques for transferring features across different domains.Given that our proposed approach is primarily inspired by image-to-image translation techniques, other related technologies for cross-modality translation, e.g., non-image to image translation 17 , will not be discussed in this work.
Inspired by image translation techniques, we develop a novel deep learning framework called CycleDNN to predict CETSA features across cell lines.CycleDNN contains encoders {E 1 , E 2 , . . .E n } and decoders {D 1 , D 2 . . .D n } corresponding to cell lines {C 1 , C 2 . . .C n } .Each encoder E i translates the CETSA features of cell line C i into a latent space Z , and each decoder D i translates Z back into the CETSA feature in cell line C i (i ∈ {1, 2, . . .n}) .Any encoder and decoder can be paired to form an auto-encoder for predicting CETSA features from one cell line to another.Together they form the prediction from one cell line to another under the guidance of the prediction loss, the cycle-consistency loss, and the latent space regularization loss.Thus, our approach enables the reciprocal prediction of features across different cell lines.While our method is inspired by pix2pix and CycleGAN, it differs in that all our auto-encoders are constructed using deep neural networks (DNN) as opposed to generative adversarial networks (GAN) 18 .In addition, these auto-encoders have identical network architectures but operate with different parameters.
We have previously reported that MS-CETSA data could be used to predict PPI scores using the decision tree model, a classic machine learning model 19 .In this study, we further explore the PPI prediction from CETSA data and treat it as an evaluation metric to verify the efficiency of the proposed CycleDNN for CETSA feature prediction, i.e., whether the translated CETSA feature could also be adapted for PPI prediction.In our study, we use the predicted CETSA data by our proposed CycleDNN to predict the PPI scores and compare the performance with that of the experimental CETSA data taken from Tan et al. 13 .The PPI score prediction results further verify the effectiveness of our proposed method.
The preliminary results of this study have been reported in 20 .Significant changes have been made compared to our previous work.Firstly, we work out a novel training and testing framework that is computationally efficient and flexible for CETSA feature prediction across multiple cell lines.Secondly, we perform extensive experiments on multiple cell lines to verify the effectiveness of the proposed framework.Lastly, we included PPI prediction in an evaluation task and performed additional experiments to further verify the feasibility of the proposed framework.The main contributions of our research work are summarized as follows: • We introduce a computational structure that utilizes a unique deep neural network model, CycleDNN, to convert CETSA characteristics among various cell lines.With only the CETSA features of a specific protein in a single cell line, our approach can accurately predict the CETSA features in other cell lines.
• By introducing the Z-hidden space, we adopt n encoders and n decoders corresponding to n cell lines to achieve the prediction of CETSA features of multiple cell lines.This reduces the complexity of the model from exponential to linear compared to the individual one-to-one prediction models.• We perform extensive experiments on the public CETSA feature dataset and verify the effectiveness of our method.We further perform experiments using PPI predictions with predicted CETSA features from CycleDNN and achieve similar performance compared to experimental CETSA features.
Figure 1.The diagram of the prediction of CETSA features across cell lines.
• We publish the source codes 21 of the implementation of the proposed method.Interested readers can use the source codes for their own biological feature predictions.We hope that this effort can motivate further exploration of deep neural networks for biological feature (e.g., CETSA) prediction.

Background
To the best of our knowledge, we are the first ones to study and realize CETSA data prediction across cell lines.
In this section, we will focus on related work in the field of computer vision and auto-encoders, both largely inspire our work in this study.Moreover, PPI is also closely related to our study.

Image style transfer
Fueled by the progress in deep learning 22 and generative adversarial networks 18 , significant strides have been made in the field of image style translation.A pioneering model in this domain is the neural algorithm introduced by Justin Johnson et al. in 2015 23 , which leverages VGG-19 24 and posits that deep convolutions can distill content information, while shallow convolutions can extract style information.Pix2pix 15 and CycleGAN 16 stand as prominent techniques for image-to-image translations, catering to paired and unpaired data respectively.Pix2pix refines the GAN architecture by integrating a conditional GAN model for a wide range of paired image translations.CycleGAN, conversely, is an advancement of the GAN architecture that caters to unpaired data, involving the parallel training of dual generator and discriminator models to create a cyclic route.Within this context of mutual translation of paired attributes, we borrow the concept of "Consistency" from CycleGAN.This concept suggests that the output from the second generator can serve as input to the first generator, and the outcome should correspond to the input to the second generator, and vice versa.Similarly, our CycleDNN method constructs a "cycle" to ensure that when a protein in cell line A with all its features is processed through Encoder E A , Decoder D B , Encoder E B , and Decoder D A , the output corresponds to the input protein features in cell line A.

Auto-encoders
The idea of the latent space Z is inspired by auto-encoders.Auto-encoders are a type of algorithm to learn a hidden "informative" representation of the data, which was first proposed by Rumelhart et al. 25 .With the help of the nonlinear feature extraction ability of the deep neural network, auto-encoders can obtain a good data representation, and the performance of the autoencoder is better than linear methods such as principal component analysis (PCA) 26 .In this study, the hidden "informative" representation can be considered as the common latent space Z, i.e., latent representations of the same protein that does not change when the cell line changes.However, the difference between auto-encoders and our method is that our goal is mutual predictions rather than reconstruction.
In contrast to CNNs, GANs, and other models commonly used in image transfer models, the main body of our model adopts the structure of the deep neural network (DNN), also known as Multilayer Perceptron (MLP) and artificial neural network (ANN).It is the most classical deep learning model, developed on the basis of the single-layer perceptron.It is also the most common model for auto-encoders and the most suitable model for the CETSA data.

PPI prediction
The research on PPI is mainly divided into two categories.The first one is using experimental methods, such as yeast two-hybrid screening 27 , nucleic acid programmable protein array (NAPPA) 28 , affinity purification-mass spectrometry (AP-MS) 29 , correlated mRNA expression profile 30 , synthetic lethal analysis 31 and so on.However, this kind of method is normally time-consuming and expensive.Moreover, experimental results often show notable inter-or intra-variance.This leads to the second type of method that uses computational models and other properties of proteins to predict PPIs.
The research of PPI prediction through computational models has developed particularly rapidly in recent years, mainly due to advances in machine learning and deep learning.Various new methods of deep learning, machine learning and other statistical methods are combined with various protein data to produce various new prediction methods for PPI, such as network-based models 32 , sequence-based models 33 , structure-based models 34 , genomic-based models 35 and so on.But so far, no one except our group has tried to use CETSA data to predict PPI 19 .

CETSA data
The CETSA data used in this study is from Tan et al. in 2018 13 , which consists of multiple cell lines.Each cell line consists of more than seven thousand proteins, and each protein contains 10 features from 10 temperatures.For a pair of cell lines, there are certain common proteins with CETSA features in both cell lines, while for other proteins, their CETSA features exist only in one cell line.We train the cycleDNN model based on the common proteins, the trained model can be used to predict the CETSA features from one cell line to another for those proteins that have CETSA features in only one cell line.

CycleDNN for two cell lines
Deep neural networks 22 are extensively employed in both classification and generative models across various fields such as computer vision and natural language processing.These networks exhibit greater expressiveness and feature extraction capabilities compared to perceptrons.The fundamental architecture of these networks www.nature.com/scientificreports/typically encompasses an input layer, multiple hidden layers, and an output layer.Each node within the network primarily executes a blend of a linear operation and a nonlinear activation function.Our initial step involves the construction of a model that enables pairwise data prediction across two distinct cell lines.In this scenario, CycleDNN is primarily composed of two encoders and two decoders.For a protein that concurrently exists in two cell lines (for instance, HCT116 and HEK293T, designated as cell lines A and B), we utilize two encoders, E A and E B , to transform the 10-dimensional CETSA features into a shared latent space Z , which consists of 5000-dimensional latent features.Decoders D A and D B are then employed to revert Z back to the CETSA data of cell lines A and B, respectively.During the training phase, we simultaneously train both sets of encoders and decoders.Figure 2 provides a comprehensive diagram of CycleDNN for the prediction of CETSA features between the two cell lines.
Both the encoder and the decoder are constructed using DNNs.The two encoders, E A and E B , as depicted in Fig. 2, possess identical network structures.However, it's crucial to note that they do not share parameters.Similarly, the two decoders, denoted as D A and D B , also maintain the same network structure but do not share parameters.The encoder's network structure primarily consists of three fully connected layers, each incorporating a linear operation and a linear rectification activation function (ReLU).Additionally, a dropout layer is included to mitigate overfitting.The decoder's structure mirrors that of the encoder but is assembled in the reverse order.Figure 3 illustrates the integrated encoder and decoder architecture in CycleDNN transitioning from cell line A to cell line B. Detailed networks' parameters are shown in Table 1.

CycleDNN for multiple cell lines
In this subsection, we make an effort to generalize the feature prediction by CycleDNN from two cell lines to multiple cell lines.The key to the generalization is the common latent space Z .If we directly use the deep neu- ral network to achieve prediction in any two cell lines, we need to train n(n − 1) neural networks for cell lines {C 1 , C 2 . . .C n } .Assuming that each deep neural network can be decomposed into an encoder and a decoder, and a total of n(n − 1) encoders and n(n − 1) decoders are required, as shown in Fig. 4.  www.nature.com/scientificreports/Such a model has a significant disadvantage: as the number of cell lines grows, the number of networks that need to be trained will grow exponentially, which is highly complex and expensive.To overcome this disadvantage, we introduce the common latent space Z and redesign the structure of CycleDNN for the CETSA feature prediction of various cell lines.
The common latent space Z represents latent representations of the same protein that does not change when the cell line changes.So the CETSA data of any cell line can be mapped to the common latent space Z after being encoded by the encoder of the corresponding cell line.Moreover, any decoder can decode the Z to the CETSA features in the corresponding cell line.Based on the common latent space Z , we design a new CycleDNN struc- ture, as shown in Fig. 5.   www.nature.com/scientificreports/ In the new CycleDNN structure, we only need n encoders and n decoders for cell lines {C 1 , C 2 , . . ., C n } .This new structure reduces the complexity of the model requirement from exponential to linear, which greatly reduces the parameters and training cost of the model.Our method using this structure can be generalized to any number of cell lines efficiently and flexibly.We constrain the common latent space Z by designing loss functions.In the experiment, we perform the prediction of CETSA features across five human cell lines A375, HCT116, HEK293T, HL60, and MCF7.

Loss functions
As depicted in Fig. 5, we can discern the following mapping relationships between different cell lines: (1) Encoder Through Z, we can derive the mapping function from Encoder E i to Decoder D j as F i,j : C i → C j .
In terms of loss function design, we consider three varieties.The first kind is the mean square error (MSE) loss, which measures the discrepancy between the predicted data and the ground truth, referred to as prediction loss.This category of loss function encompasses the loss incurred when predicting cell line B from cell line A and vice versa.
The second category of the loss function is the Cycle-Consistency Loss, a concept inspired by CycleGAN.Given the mapping functions F i,j : C i → C j , it's logical to assume that if we generate cell line C j from cell line C i using the mapping function F i,j , we should be able to reconstruct cell line C i from the generated cell line C j using the mapping function F j,i .In other words, F j,i (F i,j (C i )) ≈ C i .Hence, the second component of the loss function can be articulated as: The third category of loss function ensures consistency within the latent space Z, referred to as latent space regularization loss.This loss function guarantees that the underlying representations in the latent space Z of different cell lines of homogeneous proteins are approximately similar, which is critical for the successful prediction of CETSA signatures across cell lines.These potential representations in high-dimensional space should capture essential features of the protein that do not change because the protein is present in different cell lines.Z i is the potential representation of the protein obtained by the encoder E i of cell line C i , and Z k is a potential representation of the same protein but from a different cell line C k .n is the number of cell lines we adopted.Therefore, to maintain consistency in the output of encoders E i and E k , a mean-squared loss among Z i and Z k is implemented as follows: Ultimately, the aforementioned three types of losses are amalgamated with distinct coefficients α 1 , α 2 , and α 3 , which are optimized empirically.The cumulative loss function of the proposed method is expressed as follows:

Datasets description
The dataset we utilized originates from the experimental data of Tan et al. in 2018 13 .We adopted the protein melting data from A375, HCT116, HEK293T, HL60, and MCF7 intact cell CETSA experiments as the CETSA features of different cell lines for mutual prediction.These proteins' CETSA melting curves were downloaded from Tan et al. in 2018.The data file "tabless1_to_s27.zip"includes 27 CETSA data tables.Among these 27 tables, we selected five tables: S19, S20, S21, S22, and S23, which correspond to the five intensive cell CETSA experiments of A375, HCT116, HEK293T, HL60, and MCF7.We select the column attributes in these tables as T37, T40, T43, T46, T49, T52, T55, T58, T61 and T64.The ten columns serve as input to our model, which are the specific values of the protein melting curve.
The CETSA features for these five cell lines encompass relative abundance data for 8101, 7599, 7945, 7448, and 7790 proteins at ten distinct temperatures (37 and 64 • C), which have been standardized.As common protein data are necessary for z-space prediction, we use the intersection of A375, HCT116, HEK293T, HL60, and MCF7, which includes common data for a total of 4860 proteins, as the benchmark to evaluate the proposed method across multiple cell lines.These CETSA data are fed into the neural network.Finally, the datasets are randomly split into training and test sets at a ratio of 70-30%.
The PPI dataset Bioplex 3.0 36 is adopted as ground truth in evaluation.PPI scores between two proteins of Bioplex 3.0 are a normalized value between [0, 1].A bigger score indicates a higher probability of interaction between the two proteins.There are 25,485 protein pairs for cell line HCT116 and 41,490 protein pairs for cell line HEK293T in the CETSA dataset.Since the PPI data of HEK293T in Bioplex is relatively more comprehensive, we only use the PPI data of HEK293T for training and testing. (1) (2) (3) Vol

Metrics
To validate the efficacy of our model, we employ the following evaluation metrics.Mean square error (MSE), mean absolute percentage error (MAPE), mean absolute error (MAE), R-squared ( R 2 ), and Pearson correlation coefficient (PCC) are frequently used as evaluation metrics in regression models.Assuming the model's input is X, the predicted value is y ′ , the actual value is y, and the mean value of y and y ′ are ȳ and ȳ′ , the expressions for these evaluation parameters are as follows:

Implementation details
In this study, we adopt the PyTorch library to implement our model and conduct CETSA feature prediction experiments on the NVIDIA platform with GeForce GTX TITAN X GPUs.We train the model on the training set and evaluate the performance on the test set.During training, CETSA data is randomly shuffled.Each model is trained for a total of 4000 epochs on 1 GPU with a total batch size of 128.All models are trained from scratch and are optimized using stochastic gradient descent with momentum at 0.95 and weight decay of 1e − 5 .The base learning rate is 0.01 and declined by 5% every 500 iterations.The dropout rate is set to 0.3 to improve the robustness.The hyper-parameters α 1∼3 are 1, 0.01, and 1, respectively.

Training details
At each epoch, CETSA features in each cell line are translated to all cell lines.In addition, CETSA features in each cell line are reconstructed after prediction to keep cycle-consistency.For example, when we only consider the case of three cell lines, at each epoch, the features in cell line A are translated to cell lines B and C through the corresponding encoder and decoders.After prediction, the predicted CETSA features are reconstructed from cell lines B and C back to the CETSA features of cell line A. The same training procedure is repeated for cell lines B and C. In the training process across five cell lines, our strategy involved sequentially training the model to predict CETSA features from cell line A to cell line A...E, followed by cell line B to cell line A...E, and so forth, until cell line E. The training was set to stop if there were no observed improvements in performance for 300 consecutive epochs.Additionally, the maximum number of training epochs was capped at 5000.
In line with the implementation details, our model obtained convergence after 4000 epochs in training, culminating in optimal performance.This process was completed in a span of 4.5 h within our test environment.Additionally, during the training of two distinct cell lines, optimal results were achieved with just 1.6 h of training.This duration of time consumption aligns closely with the scale of our model.

PPI prediction evaluation
Machine learning methods are widely used in various fields of bioinformatics.We adopt a decision tree as our model to predict PPI scores between protein pairs based on the protein's CETSA features.The 5-fold crossvalidation approach is applied to the decision tree model, where the ratio of the training set and test set is 4:1.In comparison, we use the CETSA experimental data and the CETSA data predicted by CycleDNN to train the PPI prediction models, respectively.If the two prediction models achieve similar performance, it indicates the validity of the predicted CETSA features from our proposed CycleDNN.

Two cell lines (A375 and HCT116)
The main results of this study with two cell lines are listed in Table 2.According to the results, the MSE, MAPE, MAE, R 2 and PCC of our models reach 0.01232, 13.971% , 4.61 × 10 −4 , 0.89773 and 0.94750 in the prediction from the A375 cell line to the HCT116 cell line.It also works in the prediction from the HCT116 cell line to the A375 cell line, which reaches 0.01805, 15.950% , 5.68 × 10 −4 , 0.88773 and 0.94221 in MSE, MAPE, MAE, R 2 and ( 5)  2, are much more precise than those of other cell lines.This phenomenon also exists in subsequent experiments.Since we are the first to realize automatic CETSA feature prediction across cell lines, there are no existing research methods to compare.

Multiple cell lines
The main results of this study with three cell lines and five cell lines are listed in Tables 3 and 4, respectively.It can be seen from the experimental data that our model can be effectively applied no matter which two cell lines are used for CETSA feature prediction.The experimental results in Tables 3 and 4 with three and five cell lines are also similar to those with two cell lines.It further verifies the validity of our method's design for multiple cell lines.
As can be seen from Tables 3 and 4, in the prediction across different cell lines, most results of CETSA feature prediction achieved the MSE below 0.002, the MAPE below 20%, the MAE below 0.001, the R 2 above 0.75 and the PCC above 0.88.This illustrates the overall effectiveness of our method.Meanwhile, the quality of the prediction results differs among cell lines.The best performance of the CETSA feature prediction with three and five cell lines is from cell line A (A375) to cell line C (HEK293T).In Table 4, it reaches 0.00632, 11.589% , 3.27 × 10 −4 , 0.95017 and 0.97482 in MSE, MAPE, MAE, R 2 and PCC.Moreover, its MSE, MAPE, MAE, R 2 and PCC also reach 0.00635, 12.197% , 3.22 × 10 −4 , 0.95015 and 0.97281 in Table 3.
In addition, adding more cell lines to our method can partially improve the performance of our model.From the comparison of Tables 3 and 4, we can see that the accuracy of some predictions is improved.For example, the prediction precision from cell line A (A375) to cell line C (HEK293T) in the model of five cell lines is better than that in the model of three cell lines.Its MSE, MAPE, and PCC are improved from 0.00635, 12.197% , and 0.97281 to 0.00632, 11.589% and 0.97482.This indicates that adding more cell lines may further improve the accuracy of extracting information from the latent Z space.

Ablation study
In this section, we explore the performance of different loss functions in the proposed method by conducting ablation experiments, including cell lines A375, HCT116, and HEK293T.As mentioned in the methodology section, we propose CycleDNN with prediction loss L 1 , cycle-consistency loss L 2 , and latent space regularization loss L 3 .We explore all these variants quantitatively.
Table 5 shows the MSE, MAPE, and MAE results of different variants of the proposed network.Comparing all variants with our complete proposed model, it can be seen that all of the loss functions contribute to the performance, while L 1 plays the most important role.CycleDNN, by employing all three loss functions, amalgamates the benefits of each loss function, and the optimal performance is achieved through coefficient optimization.These experimental comparisons underscore the effectiveness of each of the three loss functions in our proposed method, thereby validating the design of our method.Notably, CycleDNN with Z space proves to be valuable.From a biological standpoint, the same protein, though encoded from different cell lines via the corresponding encoder, should possess common features in the latent space Z. www.nature.com/scientificreports/

Protein-protein prediction
In this part, we use the CETSA features of 4860 proteins of HEK293T predicted from cell line A375 through trained CycleDNN as the input to the decision tree model.Our predicted input corresponds to 21,536 protein interaction pairs in BioPlex 3.0 37 .In the results of PPI prediction using a decision tree, our predicted CETSA data of cell line HEK293T obtained an MAE evaluation of 0.072198, which is very close to the MAE of 0.070726 obtained from the experimental CETSA data.Moreover, as shown in Fig. 6, the shape of histograms for prediction and ground truth are quite similar, which indicates that the predicted PPI scores match the ground truth PPI scores very well.This further verifies the effectiveness of our prediction model CycleDNN in the applications of CETSA data.

Advantages and limitations
In our method, an encoder contains 2.761M parameters, and the computational cost of that is 5.510M FLOPs.Moreover, a decoder in our method contains 2.755M parameters, and the computational cost of that is 2.756M FLOPs.
The emergence of CycleDNN greatly reduces the workload of CETSA biochemical experiments.For a typical task, we need to know the CETSA value of a certain protein in n cell lines.If we rely solely on experiments, we will have to repeat n times of CETSA biochemical experiments in n cell lines to obtain CETSA values in different cell lines, which is undoubtedly extremely expensive and time-consuming.With the help of CycleDNN, we only need to perform one CETSA biochemical experiment in one cell line (e.g.), HEK293T).The CETSA data for this protein in other cell lines will be predicted by CycleDNN instead of relying on experiments.CycleDNN has great advantages over traditional pair-wise DNN models (as shown in Fig. 4) for the prediction of CETSA data.First of all, due to the introduction of the common latent space Z, we significantly simplify the amount of parameters of the neural network.While maintaining the same size of the networks across n cell lines, our model reduces the number of encoders and decoders from n(n − 1) to n.In experiments of five cell lines, CycleDNN reduces the amount of model parameters by 75% .Moreover, our method also has a great advantage in prediction speed.In a typical task, we know the CETSA data for a new protein in one cell line and wish to predict the CETSA data in n other cell lines.Our method greatly reduces the number of encoders required in prediction, thereby increasing the prediction speed.In experiments of five cell lines, CycleDNN reduces the amount of encoder by 80% , which also reduces prediction time by 53.4%.On the other hand, the proposed approach has two primary limitations.Firstly, the model necessitates initial training on the CETSA features of specific proteins found in both cell lines.It then predicts the CETSA features of the remaining proteins from one cell line to another.This capability is limited to handling CETSA feature translation only across cell lines used for training the model.However, the model cannot handle CETSA feature translation in cell lines that were not part of the training set.Secondly, while the proposed cycleDNN serves as an automated computational framework for predicting CETSA features across cell lines, and the predicted values   www.nature.com/scientificreports/closely align with the original experimental features as validated in this study, a thoughtfully designed biological evaluation is recommended to further confirm the biological significance of the predicted CETSA features.

Conclusions
In this study, we focus on the transfer of CETSA data for the same protein across different cell lines, for which we propose a novel DNN model.The results of our proposed method, as applied to the protein melting data from intact cell MS-CETSA experiments, are presented in Tables 2, 3 and 4. The results demonstrate that it performs well in the prediction cross the cell lines A375, HCT116, HEK293T, HL60, and MCF7.The ablation study in Table 5 verifies the effectiveness of each of the three loss functions in our proposed model.At the same time, the neural architecture we design greatly reduces the complexity of the model from exponential to linear when converting CETSA features between different cell lines.Last but not least, we perform experiments using PPI predictions with predicted CETSA features from CycleDNN, which achieve similar performance compared to experimental CETSA features.
Our future research endeavors will focus on three key areas.Firstly, we aim to explore the potential utility of the encoded high-dimensional latent features in PPI prediction by comparing the performance of latent features extracted by CycleDNN and standard CETSA features.Secondly, we plan to extend the application of cycleDNN to different protein features.This will involve incorporating different types of features, such as protein amino acid sequences and structural attributes, into our model to enable the interconversion between different protein features.Lastly, a focal point will be the refinement of the network structure to improve the overall performance of our model and thereby expand its applicability in bioinformatics.Our goal is to develop a computational framework capable of seamlessly converting a broader range of protein data across different cell lines through a shared protein latent space.

Figure 2 .
Figure 2.An illustration of utilizing CycleDNN for the transference of CETSA features between cell line A and cell line B. Z and Z ′ are kept nearly identical to establish a shared protein latent space.

Figure 3 .
Figure 3. Architecture of the encoder and decoder within CycleDNN.

Figure 5 .
Figure 5.The diagram of CycleDNN with n encoders, n decoders and latent space Z for cell lines {C 1 , C 2 , . . ., C n }.

Figure 6 .
Figure 6.The distributions of the ground-truth (left) and predictions (right) of PPI scores in cell line HEK293T.

Table 1 .
CycleDNN architecture: detailed parameters and characteristics of encoder and decoder layers.

Table 2 .
Performance of CycleDNN between cell line A (A375) and cell line B (HCT116).