Driving forces of digital transformation in chinese enterprises based on machine learning

With advanced science and digital technology, digital transformation has become an important way to promote the sustainable development of enterprises. However, the existing research only focuses on the linear relationship between a single characteristic and digital transformation. In this study, we select the data of Chinese A-share listed companies from 2010 to 2020, innovatively use the machine learning method and explore the differences in the predictive effects of multi-dimensional features on the digital transformation of enterprises based on the Technology-Organization-Environment (TOE) theory, thus identifying the main drivers affecting digital transformation and the fitting models with stronger predictive effect. The study found that: first, by comparing machine learning and traditional linear regression models, it is found that the prediction ability of ensemble earning method is generally higher than that of tradition measurement method. For the sample data selected in this research, XGBoost and LightGBM have strong explanatory ability and high prediction accuracy. Second, compared with the technical driving force and environmental driving force, the organizational driving force has a greater impact. Third, among these characteristics, equity concentration and executives’ knowledge level in organizational dimension have the greatest impact on digital transformation. Therefore, enterprise managers should always pay attention to the decision-making role of equity concentration and executives’ knowledge level. This study further enriches the literature on digital transformation in enterprises, expands the application of machine learning in economics, and provides a theoretical basis for enterprises to enhance digital transformation.

in natural sciences than in social sciences, the powerful learning ability and self-correcting ability of machine learning are very suitable for the quantitative analysis of the causal relationship among variables in the economic field.With more scholars studying and updating machine learning algorithms themselves, machine learning models have greater advantages in terms of analysis speed, accuracy and comprehensiveness of results 25,26 and its application to the digital transformation of enterprises has begun to thrive.This study examines the application of machine learning in the field of enterprise digital transformation, summarizing as follows: (1) Akbari et al 27 .used Random Forest Regression to study the driving factors of economic and financial integration, concluding that integration is a gradual process.Meanwhile, the combination of Random Forest Regression and evidence theory can effectively improve the efficiency of enterprise financial risk early warning 28 (2) Kamalov et al 29 .used Logistic Regression (LR), Random Forest Regression (RFR), Multilayer Perceptron (MLP) and Long and Short-Term Memory (LSTM) to analyze and compare the effectiveness that stock prices and stock returns have in predicting stock movements, discovering that the forecast stock price is more advantageous, (3) Nazareth and Reddy 30 tested the application performance of machine learning in stock market forecast, investment portfolio management, ideal money, exchange market, financial crisis and bankruptcy and insolvency forecast 31 ; also used machine learning model to explore the forecast of financial indicators for the return of Chinese stock market.(4) The study of 32 confirmed that machine learning has a stronger early warning ability for economic crisis than traditional logic models and integration models.Samitas et al 33 .also uses machine learning as an early warning system for the financial crisis.(5) Achakzai and Peng 34 developed a new machine learning model: Dynamic Integration Selection (DES) to detect fraud in financial statements.(6) Murugan 35 used cluster-based XG Boost and cluster-based K-nearest neighbor KNN to analyze financial risk.(7) Mashrur et al 36 .stated that machine learning can predict the possibility of default of individuals or enterprises by identifying loan applicants and enterprises with similar characteristics.

The motivation for digital transformation
The core of digital transformation is to use digital technology to improve the existing organizational mode of enterprise management, fill the "data gap" between different departments of the enterprise, redesign the production and operation structure and management mode, to improve the efficiency of resource allocation and innovate the management mode 37 .Through the study of the driving factors, enterprises can understand the internal and external environment faced in digital transformation, to better carry out the digital transformation.
In recent years, many domestic and foreign scholars have discussed the preliminary factors of digital transformation of enterprises from the aspects of environment, organization, and management.Existing scholars have multiple dimensions of motivation for digital transformation of enterprises: (1) Technical motivation.Digital skills directly or indirectly affect digital transformation 38 .The individual investment in IT technology cannot produce the expected results.To have a positive impact on digital transformation, it is necessary to combine IT infrastructure with other capabilities of the company to further develop relevant transformation strategies 39 .
(2) Organizational motivation.Both digital strategy and organizational ability have positive effects on digital transformation of enterprises 40,41 .(3) Manager motivation.Compared to other factors such as technology, awareness of managers is the biggest obstacle to digital transformation 42,43 .In addition, Hu et al 44 .concluded that the overseas education and work experience of senior executives were positively correlated with the level of digital transformation of enterprises.(4) The motivation of the digital economy.Li et al 45 .believed that digital economy can support enterprises to attain key elements of digital transformation, digital financial inclusion can also significantly improve digital transformation of enterprises 46 .(5) The motivation for intergenerational inheritance.The intergenerational inheritance of family businesses will promote digital transformation to some extent, but its inhibitory effect is greater than the incentive effect 47 .(6) Enterprise internal factors.In addition to enterprise size 48 , enterprise resources, enterprise capabilities and enterprise spirit affect digital transformation as well 49 .(7) Operating environment motivation.Luo et al 50 .found that the business environment can promote digital transformation of enterprises by attracting high-tech talents and increasing technology investment.(8) Policy motivation.Wang et al 51 .discovered that government support, including government subsidies and tax incentives, had a positive influence on digital transformation of enterprises by alleviating financing constraints, increasing R&D investment and improving risk bearing capacity.Moreover, climate policy 52 and low carbon strategy 53 are also influencing factors in digital transformation of enterprises.(9) Human capital motivation.Enterprise digitization not only includes the upgrade of digitization-related hardware assets, but also requires the software support of knowledge and skills of staff 54 .(10) Huang et al 55 .considered the changes in consumer behavior and the experience of several industry backbone enterprises realizing their own transformation through the construction of digital platforms constantly enable other enterprises to embark on the road of transformation.The degree of industry competition 56 and the development level of regional big data 57 are also key factors that affecting digital transformation of enterprises.However, the above motivation studies are mainly based on a certain feature of a single dimension, lacking comprehensive consideration and comparative analysis of digital transformation motivation, and it is difficult to be applied to the whole sample.To solve the interaction and configuration effects of various dimensions, the indicators of each dimension can be classified and discussed.After comparing the similarities and differences of the characteristics of different motivation, this study applies TOE theory 16 which divide the driving factors that affect digital transformation into technical motivation, organization motivation and environmental motivation.Technical motivation serves as an important support of enterprise digital transformation, incorporating enterprise innovation ability and absorption ability,organization motivation focuses on the enterprise internal governance and structure problems; environmental motivation mainly display in government regulation and market environment, which helps to discuss enterprise digital transformation motivation more comprehensively, with the aim of finding out the key drivers of enterprise digital transformation.
In Formula 1, y is the dependent variable while x 1 , x 2 , . . .x n are independent variables.θ0 , θ 1 , . . .θ n are model parameters and ǫ is an error term.The goal of linear regression is to estimate model parameters by minimizing the sum of squared errors (MSE) as shown in Formula 2.
Among them, m is the number of samples, y (i) is the true value of the i-th sample, y (i) It is the predicted value of the i-th sample.By estimating regression coefficients, new independent variable values can be predicted and the relative importance of different independent variables to the dependent variable can be evaluated.
Secondly, LASSO regression.Lasso regression is an improvement on linear regression that adds an L1 regularization term while minimizing the sum of squared errors, as shown in Formula 3.
Among them, α is a regularization parameter used to control the complexity of the model, θ j is a model parameter other than the intercept term.The purpose of LASSO regression is to prevent overfitting of the model and improve its generalization ability by punishing larger parameter values.
Thirdly, Gradual Boosted Regression Trees (GBR).Progressive gradient regression tree is an ensemble learning method based on tree models, which generates multiple trees through multiple iterations, and then weighted and summed the predicted results of these trees to obtain the final predicted value.The objective function of gradient boosting decision tree is Formula 4.
Among them, l is the loss function used to measure the difference between the true and predicted values, is the regularization term used to control the complexity of the tree, and f k is the function expression for the k-th tree, and K is the number of trees.The advantage of gradient boosting decision trees is that they can optimize the loss function through gradient boosting, and can handle different types of loss functions, such as square loss, absolute loss, logarithmic loss, etc.The parameter estimation of gradient boosting decision trees can be solved through methods such as gradient boosting or Newton boosting.
Fourthly, Random Forest (RFR).Random forest is an ensemble learning method based on tree models, which generates multiple decision trees through multiple random sampling, and then weights or votes the predicted results of these trees to obtain the final predicted value.The objective function of a random forest is Formula 5. l, , f k ,K have same meaning as in GBR.The advantage of random forest is that it can improve the efficiency and effectiveness of the model through techniques such as parallel computing, self-help, and feature random selection.At the same time, it can handle problems such as missing values and category features.The parameter estimation of random forests can be solved through methods such as self-help or extreme random trees.
Fifth, XGBoost.XGboost is an ensemble learning algorithm based on gradient boosting trees, which can be used for both regression and classification problems.Firstly, it uses an optimization strategy called Extreme Gradient Boosting, which can build and train models on multi-core cpUs in parallel, thus greatly improving the computational speed and efficiency.Secondly, it adds a regularization term, which can control the complexity and overfitting risk of the model.The regularization term includes the number of leaf nodes in the tree, the sum of the squares of the weight of each leaf node (the score value of the leafnode), etc.The loss function is where, L(φ) represents the loss function, y i represents the predicted value of the first sample in the first iteration (the first tree), y i represents the true value, and �(f k ) represents the regular term.
(1) LightGBM is a machine learning method based on Gradient Boosting Decision Tree (GBDT).It has the following characteristics: it supports categorical features, and can directly process numerical and categorical data without one-hot coding; It supports histogram optimization, which can reduce the number of traversals of the global data set and improve the speed of decision tree construction.Gradient-based One-Side Sampling can reduce the sampling times of large Gradient samples and improve the generalization ability of the model.Exclusive Feature Bundling can combine unrelated or conflicting features into one feature to reduce feature dimension and computation.Leaf-wise with depth limitation is supported to avoid the problems of over-fitting and premature convergence.The corresponding loss function value of each sample at each leaf node is formulated as follows: where: n is the number of training samples, m is the number of categories, x i is the feature vector of the first sample, y i is the category label of the first sample, γ is the weight coefficient, f (x) is the predicted value.
In summary, ensemble learning methods effectively compensate for endogeneity and other shortcomings caused by non-linear relationships and interactions between variables in linear relationships, and thus perform well in out of sample prediction tasks 58 .Therefore, the predictive effect of ensemble learning methods on the intensity of enterprise digital transformation should be better than linear research methods such as multiple linear regression.

Model setting
To select a more effective prediction model, the model performance is investigated based on model interpretation power and prediction error.In terms of model interpretation ability, refer to the existing literature 29 , this study adopts the following three indicators: (1) In-sample goodness of fit R 2 Is , the index is used to evaluate the degree of fitting of machine learning model on training data, measure the model prediction effect of the training set, the higher the advantages of fitting in the sample, the higher the explanatory ability of the model.( 2) Out-of-sample goodness of fit R 2 oos .To overcome the defects of the In-sample goodness of fit that it cannot completely reflect the generalization of the model on the new data, this article further selects the Out-of-sample goodness of fit R 2 oos to measure the universality of the model.(3) Explanatory variance EVS oos .It is used to measure the interpretation degree of the variability of the dependent variable, and can explain the variance, that is, to calculate the variance between the predicted value and the observed value, and then measure the generalization ability of the model from the perspective of the variance.
In terms of model prediction error, according to the existing research 59,60 , out-of-sample mean squared error MSE oos is selected to measure the deviation between the predicted value and the actual value.If the model performs well on the training data but has a high mean squared error on the test data, there may be a problem of overfitting, namely that the model does not adapt well to the new data.Therefore, by calculating the out-ofsample mean-square error, the study can evaluate the performance of the model more comprehensively and determine whether it has good generalization ability.Meanwhile, to avoid the influence of extreme values, the average absolute error MAE oos and the absolute median difference MedAE oos are also used to improve the predic- tion accuracy of the model.The implications and calculations of the evaluation indicators are shown in Table 1.
Moreover, one of the main advantages of ensemble learning is that the disadvantages of a single model can be reduced by combining multiple underlying models, so it is difficult to capture the interpretation results of a single learner.In this regard, this study uses relative importance and partial dependence graph to make up for the above deficiencies and interpret the practical significance of ensemble learning.Initially, relative importance refers to the relative contribution degree or influence of each factor to the outcome during model fitting.According to the practice of Nazareth and Reddy 30 , given that the rest of the model remains constant, the relative importance of the variable can be obtained by measuring the decrease of the loss function caused by adding a variable to the model.The greater the relative importance is, the stronger the ability of this variable to predict the intensity of the digital transformation of enterprises.Secondly, the partial dependency graph refers to the measurement of the influence of a certain variable on digital transformation of an enterprise, if other features remain unchanged, and then displayed in the form of images to attain more visual features.In addition, it makes the single variable more accurate in predicting the degree of enterprise digital transformation 61 .( 7) Table 1.Model evaluation indicators and calculation methods.

Evaluation indicators
Indicator meaning Computational formula R 2

Is
In-sample goodness of fit, in the training set, the model predicts values to the actual observed values  2.

Variable definition
This study selects the Digital Transformation Index (Digitaltransindex) in the CSMAR database as the response variable.According to the CSMAR variable, the response variable using the annual report of enterprise digital transformation related word frequency statistics, including artificial intelligence (AI), block chain (BD), cloud computing (CC), big data (BD) and the application of digital technology (ADT) five parts, this measure can effectively reflect the enterprise digital transformation and transformation degree, detailed calculation are listed in the variable table.
According to the theoretical framework of TOE and the existing research on the driving force of enterprise digital transformation, this study selects the driving force characteristics of the model from the following three dimensions: Technical dimension, this study uses Tamayo et al 38 .to select the intensity of R&D expenses and the technical size as the measurement index of innovation ability and absorption ability.Organization dimensions, referring to Li et al 57 ., Schoar and Zuo 62 , Chen et al 63 .and Bandiera et al 64 ., the study selected senior manager team size (Manager Number), senior executives' knowledge level (Education Level), senior social capital (Social Network), profitability (ROA), growth (Growth), enterprise value (TobinQ), solvency (Lev), equity concentration (Top Ten Holders Rate), duality of chairman and general manager (Duality), and proportion of independent directors (IndDirector Ratio) and other ten variables to Measure characteristics of organizational drive characteristics.Additionally, referring to the research of Li et al 49 ., Luo et al 50 .,Wu and Wang 65 , financial support (Financial Support), infrastructure index (Infrastructure Score), monetary policy easing (Monetary Policy), intellectual property protection level (IP Protection), and industry competition pressure (HhiD) are taken as variables to measure the environmental characteristics of media companies.
In addition, the benchmark variable group refers to Li et al 57,66 ., Zhao et al 67 .and Hanelt et al 68 ., we set up past performance (Past Revenue), cash flow ratio (Cash Flow Ratio), enterprise age (Firm Age), enterprise size (Size), ownership (SOE), etc.As shown in also Table 3.

Descriptive statistics
According to Table 4, the average value of Digitaltransindex is 37.7564, and the standard deviation is 11.8132, which indicates the degree of digital transformation of different enterprises is significantly different, and the characteristics of other variables have no outliers, which demonstrates the rationality of the prediction.

The fitting results of the model based on the enterprise digital transformation index prediction
Table 5 lists the prediction results of the models constructed by different ensemble learning methods for the degree of enterprise digital transformation.The results in Column (1) show that the in-sample goodness of fit R 2 Is of multiple linear regression, LASSO model and GBR, which are all lower than 0.54.While the results of RFR, XGBoost and LightGBM are high, all higher than 0.9, among which XGBoost has reached 0.9867 and shown that the ensemble learning method has better in-sample fitting effect.In addition, the results of columns ( 2) and (3) of Table 5 show that the out-of-sample goodness of fit R 2 oos and explanatory variance EVS oos of LightGBM  www.nature.com/scientificreports/have the highest values, which are 0.7350 and 0.7353 respectively, followed by XGBoost, and the four indexes of the two methods are all higher than 0.72.It illustrates that ensemble learning method can better predict the degree of digital transformation of enterprises.As can be seen from column (4), the out-of-sample mean square errors MSE oos of XGBoost and LightGBM are smaller than those of the other four methods.Finally, columns ( 5) and (6) show that XGBoost and LightGBM have lower mean absolute errors MAE oos (5.3023 and 5.2542) and lower median differences MedAE oos than the linear regression method.This indicates that the model improve- ment effect is not obvious after removing the off-bias values.In summary, XGBoost and LightGBM in the ensemble learning method have better data fitting effect, so that a research model with more accurate prediction effect can be constructed.This paper will further discuss the driving force and key factors of enterprise digital transformation.

Differences in the driving force dimensions of enterprises' digital transformation prediction ability
To explore the differences in the prediction ability of different driving forces on the strength of enterprise digital transformation, this study refers to Chen 63 , and selects the benchmark models of past performance (Past Revenue), cash flow ratio (Cash Flow Ratio), enterprise age (Firm Age), enterprise size (Firm Size) and ownership (SOE).Then, referring to Bertomeu et al 69 ., calculate and compare the predictive performance of different combinations of TOE theoretical models added to the benchmark model.Considering that the research conclusions obtained based on different evaluation indicators are basically the same, this study analyzes the out-of-sample goodness of fit R 2 oos , and the research results are as shown in Table 6.Firstly, the difference in the predictive ability of a single dimension driving force for the intensity of enterprise digital transformation is considered separately.As shown in the second row of Table 6, the prediction effect is the best when the technical features are added to the benchmark model.Taking LightGBM as an example, the outof-sample goodness of fit of the model is improved to 0.7073, 0.7111 and 0.6583 after adding the characteristics of technical driving force, organizational driving force and environmental driving force into the benchmark model respectively.Secondly, considering the combination of two different types of motivations, comparing the out-of-sample goodness of fit among different groups in Table 6.It is found that the model with organizational driving force in the combination has the best fitting effect.Finally, when all three driving forces are integrated, LightGBM has the strongest explanatory power, followed by XGBoost.According to the prediction results, enterprises need to pay attention to the improvement of organizational driving forces, such as the proportion of top ten shareholders and the knowledge level of the top management team.At the same time, enterprises need to pay attention to changes in the external business environment, so as to seize the opportunity of profitable policies and improve the intensity of digital transformation.The following section will make a detailed analysis of the differences of single factors based on LightGBM and XGBoost, and put forward more specific suggestions for enterprises.7 shows the top 15 variables of relative importance in LightGBM and XGBoost prediction methods, which indicates that these characteristics are the key factors affecting the digital transformation of Chinese companies.

Prediction model of the intensity of digital transformation of enterprises by important driving factors
Based on the relative importance and ranking of the variables in Figs. 1 and 2 and Table 7, this study selects innovation ability (R&D Expenses), equity concentration (Top Ten Share Holder Rate), executive knowledge level (Education Level), industry competition degree (HhiD) and past performance (Past Revenue).These variables have higher relative importance in the dimensions of technical, organizational, environmental and benchmark respectively, and have a stronger impact on the digital transformation of enterprises.Meanwhile, they are of universal significance for the digital transformation of companies in different industries.Figure 3 is partial dependence on R&D expenses.This research selects the R&D investment ratio of enterprises as the proxy variable of innovation capability.As shown in the figure, when the R&D investment of an enterprise is higher than 10%, with the increase of the proportion of investment, the degree of digital transformation also shows a fluctuating upward trend, and reaches the peak when the R&D investment reaches about 42%.When the R&D investment reaches more than 45%, the transformation degree remains at a high level and tends to be flat.R&D investment has the highest relative importance in the technical dimension, indicating that it plays the strongest driving role in the process of digital transformation.Therefore, managers should attach    great importance to innovation, not blindly increase R&D expenses, and timely adjust the process of digital transformation.
Figure 4 shows the partial dependence diagram of equity concentration.This paper selects the shareholding ratio of the top ten shareholders as the proxy variable.In general, the fluctuation degree of the image is high, but it still shows a negative correlation trend.When the ratio is around 40%, the degree of transformation is relatively high, and it has a significant decline after reaching 57%.This shows that high equity concentration is not conducive to digital transformation, which is also related to the principal-agent problem within the enterprise.In order to promote the digital transformation and promote the innovation and sustainable development of enterprises, enterprises can introduce more shareholders and stakeholders to make more reasonable decisions.
Figure 5 shows the partial dependence diagram of executives' knowledge level, which is calculated by assigning and weighting the senior executives' education level.As shown in Fig. 5, the general trend is that the higher the level of management knowledge, the higher the degree of digital transformation.In particular, the independent variable rises steeply when it reaches 2.7, and then gradually increases.After peaking around 3.6, it begins to decline rapidly.As decision-makers, senior executives with higher education level are better able to accept and implement innovation strategies.At the same time, they also possess professional knowledge and leadership, and can lead the enterprise team to maintain smooth operation in technology research and development, operation and management.Therefore, enterprises should increase the introduction of highly educated talents, optimize the configuration of the top management team, further improve the overall quality and ability level of the top management team, and lay a solid foundation for digital transformation.
Figure 6 shows the partial dependence diagram of industrial competitive pressure, and the proxy variable is the Herfindahl index of the industry in which the enterprise is located.The higher the Herfindahl index, the higher the market concentration, the lower level of the competition.As shown in the figure, it is difficult to describe the relationship between the digital transformation of enterprises and the competitive pressure of the industry with a simple linear relationship.When the Herfindahl index is around 0.02, the degree of digital transformation is the highest.Then it drops sharply, and maintains a relatively stable trend in the range of 0.05-0.10 with a small peak.After reaching 0.18, the digital transformation intensity continues to decline.In general, the greater the competitive pressure in the industry, the higher the degree of digital transformation.Therefore, www.nature.com/scientificreports/enterprises in highly competitive industries need to pay attention to the market environment in a timely manner, strengthen the implementation of digital transformation strategy, and establish competitive advantages.Figure 7 is the partial dependence diagram of the past performance of the enterprise, which natural logarithm of the company's operating income at the end of the year as the proxy variable.As shown in Fig. 7, the past performance of enterprises shows a positive trend.When it reaches 21.5, the magnitude of the positive impact of past performance on digital transformation gradually becomes larger, accompanied by the appearance of small peaks.Therefore, the annual operating income of the enterprise positively promotes the digital transformation of the enterprise, and the gradient of the influence increases when it reaches a certain value.As a benchmark variable, past performance also ranks high in relative importance among all variables, which proves its universality.Enterprises should first pay attention to the main business, provide funds and operational capacity guarantee for digital transformation, so as to carry out digital reform according to the business situation, and realize the mutual promotion.

Robustness test
First, change the training set division method.In the main test of this study, we use 8:2 proportion in random classification to determine the training set and test set, which weakens the randomness to some extent.To evaluate the performance and generalization ability of the model more accurately, K-fold cross-validation is used to replace the training set.The basic principle of K-fold cross-validation is to divide the original data set into K subsets of similar size, where K-1 subsets are used as the training data while the remaining 1 subset is as the validation data.Then, it was repeated K times and a different subset was selected as validation data each time, resulting in the performance evaluation of K models.Usually, we use the average of the results as the final performance evaluation index of the model.The advantage is that it can fully utilize a limited dataset and reduce the variance of model evaluation results.By multiple verifications and averaging, we can more accurately evaluate the performance of the model on different subsets of data, reduce evaluation bias caused by a specific dataset, and provide more reliable evaluation results.The steps of K-fold cross-validation in machine learning are as follows: 1. Divide the original dataset into K subsets of similar size, taking K values of 10.Based on the process, K-fold cross-validation can obtain more stable evaluation results from repetition of the process to reduce the contingency caused by different data division.Meanwhile, for small data set, K-fold cross-validation can better evaluate the performance of the model, reducing overfitting or underfitting issues caused by a lack of data.As shown in Table 8, after replacing the training and test sets using the K-fold test, the correlation findings compare Table 5 with no change.
Second, change the measurement indicators of the intensity of digital transformation.To eliminate outlier or other factors that may affect the uncertainty, this study replaces the measurement indicators of the intensity of digital transformation in enterprises.According to Xiao et al 54 ., we use different entry to measure the intensity of digital transformation, eliminating the entry of "digital technology application" from the application level and keeping only basic digital technology level entries "artificial intelligence", "chain of block technology", "cloud computing" and "big data technology" .After the total frequency plus 1, we take natural logarithm as the new response variable.The model was re-trained and evaluated using the new response variable.The specific test results are shown in Table 9, the results after the change are consistent with the main test, indicating that the model in this study is robust.

Discussion
Through reviewing the existing literature, it is found that scholars mainly focus on the correlation between a factor of a single dimension and the intensity of enterprise digital transformation, and only make predictions within the sample, lacking comprehensive consideration of the driving force of enterprise digital transformation.In this study, the driving force of enterprise digital transformation is divided into three dimensions: technical driving force, organizational driving force and environmental driving force.By combining and comparing the driving forces of two or three dimensions, the differences in the predictive ability of different dimensions of indicators is listed and the relatively key driving factors are identified.Meanwhile, most existing studies only use traditional econometrics as a tool, which makes it difficult to avoid the interaction between factors and has certain endogeneity issues.This study takes the relevant data of Chinese A-share listed companies from 2010 to 2020 as samples, discusses the driving force of digital transformation in enterprises, and innovatively uses ensemble learning methods to conduct analysis, which can improve the accuracy of model prediction and enhance its generalization ability.With relative importance ranking and partial dependence graphs, by comparing the fitting effects of adding different dimensional factors to the benchmark model, it is found that technical factors can more effectively and accurately predict the digital transformation behavior of enterprises.This means that in the process of enterprises pursuing digital transformation, technology driving force dominates.Compared with linear methods such as multiple linear regression, the ensemble learning method achieves better performance in high model interpretation ability and less prediction error, among which XGBoost method has the best prediction performance when applied to the samples used in this study.Among many driving force characteristics, equity concentration and knowledge level of executives in the dimension of organizational driving force, and innovation ability in the dimension of technical dimension have the best prediction effect.
Based on the above conclusions, this study proposes the following policy suggestions: (1) For governments, policy support, financial support, technical support, and cooperation opportunities should be provided for enterprises.Financial and tax incentives can be provided to encourage enterprises to invest in the construction of digital technology and information system.Set up special funds to increase the digital infrastructure construction such as network foundation design, cloud computing center and data center, etc.For enhancing the operation performance of enterprises, government can organize professional team and cooperation institutions for technical staff training, encourage higher education institutions, research institutions, and others to participate in the research and innovation work of digital transformation.
(2) For the senior management team in enterprises, the strategic goal and path of digital transformation should be clarified.They should strengthen the reserve of high-level talents, and reasonably adjust the proportion of technology research and development.As shown in Fig. 3, when the R&D investment of an enterprise is around 40%, it plays a greater role in promoting the impact of digital transformation.Enterprises should maintain this proportion as much as possible, not blindly invest in R&D, and maximize the transformation.At the same time, enterprises should also assess the risks in the process of digital transformation, take appropriate risk control and response measures, pay attention to the industry policy direction and enterprise value.They can make use of the good economic situation to carry out the layout of transformation.In the process of transformation, performance management is important.Enterprises should actively adjust and innovate their organizational structure, business process and working mode, take the lead in ensuring the stable growth of main business.Then seize the opportunity to carry out digital technology research and development, implement digital transformation strategy, and ensure sufficient funds and organizational stability in the process of transformation.(3) For scholars, continue to focus on the trend of digital transformation.Write professional reports and application cases to provide valuable information and guidance for enterprises and governments, vigorously apply research results to practical scenarios, help enterprises solve practical problems, promote the process of digital transformation, and promote the mutual flow of knowledge and technology.
The limitations of this study are as follows: First, because the data in this study are not randomly sampled, but based on the availability of data, they are not without significant differences from the industry and size distribution of China's A-share companies, which may lead to the difference in the prediction effect of the potential fitting model.Secondly, the TOE framework cannot cover all the relevant variables and driving factors, for example, the differences in digital transformation modes of different enterprises caused by the characteristics of different industries are not examined.A separate discussion on the degree of digital transformation in each industry will be one of our future research directions.Third, the machine learning methods used in this paper are all black box algorithms.Despite the data robustness test, there is still a risk that the empirical results will be biased due to the errors generated by the algorithm itself.Therefore, it can be considered to combine other analysis methods to make a more comprehensive consideration of enterprise digital transformation.

Figure 4 .
Figure 4. Partial dependence on Top Ten Share Holder Rate.

Figure 5 .
Figure 5. Partial dependence on Education Level.
sample goodness of fit, in the training set, the model predicts values to the actual observed valuesEVSData sourceIn this study, the A-share listed companies from 2010 to 2020 are taken as the initial sample, namely listed companies in Shenzhen Stock Exchange and Shanghai Stock Exchange of China.Company data derives from the Wind and CSMAR databases.In order to exclude the interference of some special observation samples to the prediction results, this study handles the data as follows: (1) Excluding enterprises with abnormal ST, PT and other listing status, avoid the interference with the overall prediction effect because of the abnormal operation of the enterprise itself; (2) Eliminate the samples with serious missing data;(3)The continuous variables in the data are winsorized according to 1% and 99% quantiles to avoid the interference of extreme outliers.Finally, 8310 observed values are obtained, and the yearly distribution of observations is shown in the Table oos Explanatory variance, in the prediction set, the fit of the degree of variation to the actual observed value EVS oos = 1 − var(y − ŷ) / var y MSE oos Mean squared error, the expected value of the square between the out-of-sample predicted value and the actual value MSE oos = 1/n n i=1 y i − y i 2 MAE oos Average absolute error, the expected value of the difference between the out-of-sample predicted and actual value MAE oos = 1/n n i=1 y i − y i 2 MedAE oos Absolute median difference, median of the difference between out-of-sample predicted and actual values MedAE oos = median of y i − y i Vol:.(1234567890)Scientific Reports | (2024) 14:6177 | https://doi.org/10.1038/s41598-024-56448-wwww.nature.com/scientificreports/Data sources and variable definitions

Table 2 .
Yearly distribution of observations.

Table 3 .
Variable definition.The education level of the senior executive team is measured, that is, the value of other degrees is 1, the college degree is 2, 3, and the graduate degree is 4. The sum of the weight of the senior executive team is divided by the total number of people to obtain the average number to represent the education level of the senior executive teamSocial networkMeasure by the total number of senior executives working in other enterprises in the corresponding year Top ten holders Rate Share ratio of the top ten shareholders Market value of tradable shares + number of non-tradable shares net assets per share + book value of liabilities)/total assetsEnvironmentalFinancial supportThe ratio of the local financial expenditure on science and technology to the public budget revenueInfrastructure scoreThe entropy right method is used to construct the infrastructure application and development indicators supporting the development of digital economy into an infrastructure index, with provincial annual dataMonetary policyThe annual M2 growth rate for that yearIP protectionThe ratio of the contract amount of the technology market of each province to the GDP of each province in the current year is divided into provincial annual dataHhiDThe Herfindahl-Hirschman Index of the industry in which the enterprise operates

Table 5 .
Results of model fitting.

Table 6 .
Prediction performance under different combinations of driving forces.Differential analysis of the prediction ability of digital transformation by key factors under different driving forcesBased on the above analysis, the prediction effect of XGBoost and LightGBM is better.Therefore, the two ensemble learning methods of XGBoost and LightGBM are applied to compare the difference in the prediction ability of different variables in the machine learning model for the intensity of enterprise digital transformation by comparing the relative importance.Figures1 and 2report the ranking of relative importance of variables, and Table Vol.:(0123456789) Scientific Reports | (2024) 14:6177 | https://doi.org/10.1038/s41598-024-56448-wwww.nature.com/scientificreports/

Table 8 .
Test of robustness -Panel A.

Table 9 .
Test of robustness-Panel B.