AutoML-ID: automated machine learning model for intrusion detection using wireless sensor network

Momentous increase in the popularity of explainable machine learning models coupled with the dramatic increase in the use of synthetic data facilitates us to develop a cost-efficient machine learning model for fast intrusion detection and prevention at frontier areas using Wireless Sensor Networks (WSNs). The performance of any explainable machine learning model is driven by its hyperparameters. Several approaches have been developed and implemented successfully for optimising or tuning these hyperparameters for skillful predictions. However, the major drawback of these techniques, including the manual selection of the optimal hyperparameters, is that they depend highly on the problem and demand application-specific expertise. In this paper, we introduced Automated Machine Learning (AutoML) model to automatically select the machine learning model (among support vector regression, Gaussian process regression, binary decision tree, bagging ensemble learning, boosting ensemble learning, kernel regression, and linear regression model) and to automate the hyperparameters optimisation for accurate prediction of numbers of k-barriers for fast intrusion detection and prevention using Bayesian optimisation. To do so, we extracted four synthetic predictors, namely, area of the region, sensing range of the sensor, transmission range of the sensor, and the number of sensors using Monte Carlo simulation. We used 80% of the datasets to train the models and the remaining 20% for testing the performance of the trained model. We found that the Gaussian process regression performs prodigiously and outperforms all the other considered explainable machine learning models with correlation coefficient (R = 1), root mean square error (RMSE = 0.007), and bias = − 0.006. Further, we also tested the AutoML performance on a publicly available intrusion dataset, and we observed a similar performance. This study will help the researchers accurately predict the required number of k-barriers for fast intrusion detection and prevention.


Scientific Reports
| (2022) 12:9074 | https://doi.org/10.1038/s41598-022-13061-z www.nature.com/scientificreports/ The machine learning methods discussed above involve manual selection of the best performing algorithm, which may lead to bias results if the results are not compared with the benchmark algorithm. In addition, the optimisation of the hyperparameter associated with each algorithm is treated differently. To solve this problem, in this paper, we introduced an automated machine learning (AutoML) model to automate the model selection and hyperparameter optimisation task. In doing so, we synthetically extracted potential predictors (i.e., area of the region, sensing range of the sensor, transmission range of the sensor, and the number of sensors) through Monte Carlo simulation. We then evaluated the predictor importance and predictor sensitivity through the regression tree ensemble approach. Subsequently, we applied AutoML on the training datasets to get the best optimised model. We evaluated the performance of the best performing algorithm over the testing data using R, RMSE, and bias as performance metrics.

Material and methods
Predictor generation. The quality of the prediction of a machine learning model depends on the quality of predictors and the model hyperparameters 24 . These predictors can be categorised into real and synthetic-based upon the dataset acquiring process. The real data can be obtained through direct measurements through instruments or sensors. However, the generation of real data involves intensive cost and labor. In contrast to real data, synthetic data can be obtained through mathematical rules, statistical models, and simulations 25 . In comparison to real data, acquiring synthetic data is efficient and cost-effective. Due to this, the use of synthetic datasets to train machine learning models is increased in the past lustrum 21,[26][27][28][29] .
We adopted the synthetic method to extract the predictor datasets using Monte Carlo simulations. In doing so, we have used network simulator NS-2.35 to generate the entire dataset. A finite number of homogeneous (i.e., sensing, transmission, and computational capabilities are identical for each sensor) sensor nodes are deployed according to Gaussian distribution, also known as a normal distribution in a rectangular RoI to achieve this. Gaussian distribution is considered in this study since it can improve intrusion detection capability and is preferred for realistic applications. In a Gaussian distributed network, the probability that a sensor node is located at a point (x, y) in reference to the deployed location (x 0 , y 0 ) 30,31 is given by: where σ x and σ y are the standard deviations of x and y location coordinates, respectively.
To evaluate the performance of WSNs, we have considered the Binary Sensing Model (BSM) 32 , which is the most extensively used sensing range model. Each sensor (S i ) is assumed with the sensing range (R s ) and is deployed at an arbitrary point (P(x i , y i )). As per BSM, the target can be detected by any random sensor with 100% probability if the target lies with in the sensing range of the sensor. Otherwise, the target detection probability will be equal to zero and is represented mathematically as: 2 , the Euclidean distance between S i and target point P. In addition, we have considered that any two sensors can communicate if they satisfy the criteria, R tx ≥ 2R s , where R tx and R s represents the transmission range and sensing range, respectively. A barrier is constructed by joining a cluster of sensor nodes across the RoI to detect the presence of intruders. Furthermore, to assure barrier coverage, it is required to identify a Barrier Path (BP) in the RoI. The sensor nodes detect each intruder in the path in this scenario. Thus, to ensure guaranteed k-barrier coverage in the rectangular RoI, the number of required nodes is computed as : k = ⌈ L 2R s ⌉ and maximum number of BPs can be computed as BP max =⌊ N k ⌋ 33 , where L is the length of the rectangular RoI, R s is the sensing range of nodes, and N is the number of sensor nodes. Table 1 lists the various network parameters and their values that have been used to obtain the simulation results.
Relative predictor importance. In machine learning, the choice of input predictors has a substantial control on its performance 28 . Predictor importance analysis is not restricted to any particular representations, tech- www.nature.com/scientificreports/ niques, or measures and can be used in any situation where predictive models are required. It is used to express how significant the predictor was for the model's predictive performance, irrespective of the structure (linear or nonlinear) or the direction of the predictor effect. We calculated the relevancy of the selected predictors in estimating the k-barriers by estimating each predictor's relative predictor importance score. To do so, we have used the regression tree ensemble technique 21,34 . It is an inbuilt class with a tree-based classifier that assigns a relative score for every predictor or attribute of the data. The higher the score, the more important the predictor. Initially, we trained a regression tree ensemble model by boosting hundred regression trees (i.e., t = 100) with a learning rate of one (i.e., δ = 1) each using the Least Squares gradient Boosting (LSBoost) ensemble aggregation method. Boosting an ensemble of regression algorithms seems to have several advantages, like, handling missing data, representing nonlinear patterns, and yielding better generalisation if weak learners were combined into a single meta learner. In addition, the LSBoost ensemble minimises the mean square error by combining individual regression trees, often known as weak learners. The LSBoost technique successfully trains weak learners on the testing data set, fitting residual errors, and detecting its weak points. Based on such weak points, it generates a new weak learner ( l i ) during every iteration. It evaluates its weight ( ω i ) in order to enhance the difference between the response value and the aggregated predicted value, hence increasing prediction accuracy. Finally, the algorithm updates the current model ( M i ) by emphasising on the prior weak learner's ( M i -1) weak point according to Eq. (3). It then integrates the weak learner into the existing model after training and iteratively generates a single strong learner ( M n , i.e., ensemble of weak learners).
To explore further the predictor importance, we estimated the coefficients indicating the relative importance of each predictor within the trained model by computing the total variations in the node risk ( R) due to split among each predictor, and then normalising it by the total number of branch nodes ( R BN ) and is mathematically represented as: where R P indicates the node risk of the parent and R CH1 & R CH2 indicates the node risk of two children. The node risk at individual node (R i ) is mathematically represented as in Eq. (5); where P i denotes the probability of node i and E i denotes the node i mean square error.
Predictor sensitivity. We have performed the sensitivity analysis of the predictors using Partial Dependence Plot (PDP) 21,35 . PDP depicts whether a model's predicted response (outcome) changes as a single explanatory variable varies. These plots have the advantage of exhibiting the form of relationship that exists between the variable and the response 36 . Moreover, it depicts the marginal effect of one or more variables on the predicted response of the model 37 . In this study, we have considered the combined impact of two predictors simultaneously from the input predictor set (i.e., υ ) on the predictand by marginalising the impact of the remaining predictors. To accomplish this, a subset υ s and a complimentary set ( υ c ) of υ s is extracted from the predictor set ( υ = {z 1 , z 2 , . . . , z n } ) where n represents the total number of predictors. Any prediction on υ is determined by Eq. (6) and the partial dependence of the predictor in υ s is inferred by computing the expectation (E c ) of Eq. (6): where ρ c (υ c ) indicates the marginal probability of υ c , which is represented in Eq. (8).
Then, the partial dependency of the predictor in υ s can be determined by : where U represents the total number of observations. Automated machine learning model. AutoML is used to automate the machine learning process such as data pre-processing, predictor or feature engineering, best algorithm selection, and hyperparameter optimisation [38][39][40] . For past few years, it has been widely used in industry and academia to solve real and near realtime problems [41][42][43] . In this study, firstly, we have performed the predictor standardisation using Z-score scaling 44 . Afterward, we divided the complete dataset randomly using Mersenne Twister (MT) random generator in an 80:20 ratio for training and testing the AutoML model. The dimension of the complete dataset is 182 × 5, where www.nature.com/scientificreports/ 182 is the number of observations and 5 is the number of predictors (i.e., area of the region, sensing range of the sensor, transmission range of the sensor, and the number of sensors) and the response variable (i.e., k-barrier).
The dimension of the training dataset is 145 × 5, and the dimension of the testing dataset is 37 × 5. After data division, we have automated the algorithms selection and hyperparameter optimisation step and investigated its performance. Various explainable machine learning models participate in the algorithm selection process, which is discussed next in the upcoming subsections. Support vector regression model. The Support Vector Regression (SVR) model was introduced by Vapnik et al. 45 , and it was developed primarily using the Support Vector Machine (SVM) classifiers. The SVR model has the benefit of being able to optimise the nominal margin using regression task analysis and is a popular choice for prediction and curve-fitting both for linear and nonlinear regression types 46 . The relationship among input and output variables for nonlinear mapping 47 is determined by: where p= (p 1 , p 2 , . . . , p n ) indicates the input, y i ∈ Rl indicates the output, w ∈ R n indicates the weight vector, q ∈ R indicates the constant, n indicates the number of training datasets and φ(p) indicates an irregular function that is used to assign the input to the predictor. To determine w and q, Eq.
In the SVR model, the three basic hyperparameters used are the insensitive loss function ( ǫ ) that specifies the tolerance margin; the capacity parameter or penalty coefficient or box constraint (C) that specifies the error weight; and the Gaussian width parameter or kernel scale ( γ) 48,49 . A high value of C lets SVR reminisce the training data. The smaller ǫ value implies noiseless data. However, the γ value is equally responsible for the under-adjustment or over-adjustment of prediction. Mathematically, it is represented as: where K represents the kernel function, γ represents the kernel scale that manages the influence of predictors variation on kernel variation. Gaussian process regression model. Gaussian Process Regression (GPR), also known as kriging 50 is based on Bayesian theory 51 and is used to solve complex regression problems (high dimension, nonlinearity), facilitates the hyper-parameter adaptive acquisition, easy to implement, and is used with no loss of performance. The fundamental and extensively used GPR is mainly comprised of a simple zero mean and squared exponential covariance function 52 as represented in Eq. (13).
where where k(x, x′) represents the covariance function or kernels that provide the expected correlation among several observations. In the GPR model, there are two hyperparameters used, such as the model noise ( ̟ f ) and the length scale (g) that regulates the vertical scale and the horizontal scale of the function change, respectively.
Binary decision tree regression. A Binary Decision Tree (BDT) regression is formed by performing consecutive recursive binary splits on variables, that is of the form y i ≤ v, y i ≥ v, where v ∈ R are observed values in a binary regression tree 53 , which is represented as: where T(y) indicates the regression tree, M indicates the number of tree's terminal nodes, and B m (y) indicates the base function which is determined by: Subject to : where L m indicates the total splits, y i indicates the involved variable, and v im indicates the splitting value. Moreover, the decision tree establishes the rule till the samples in a leaf fall under a specified size, i.e., the minimum leaf (min-leaf) size 54 . Since the min-leaf size defines when splitting must be terminated, it is considered a vital parameter that must be fine-tuned.
Ensemble regression model. Perrone and Cooper 55 proposed a general conceptual framework for obtaining considerably better regression estimates using ensemble methods. Ensemble Learning (EL) enhances performance by building and combining several base learners with specific approaches. It is mainly used when there is a limited amount of training data. It is challenging to choose a suitable classifier with this limited available data. Ensemble algorithms minimise the risk of selecting a poor classifier by averaging the votes of individual classifiers. This study has applied bagging and boosting EL methods due to their widespread usage and effectiveness for building ensemble learning algorithms.
Bagging (Breiman 56,57 ), also known as bootstrap aggregation or Random Forest (RF), is one of the most prominent approach for building ensembles, that uses a bootstrap sampling technique to generate multiple different training sets. Subsequently, the base learners are trained on every training set, and then combining those base learners to create the final model. Hence, bagging works for a regression problem as follows: Consider a training set, S that comprises of data {(X i , Y i ), i = 1, 2, . . . , m} , where X i and Y i represents the realisation of a multidimensional estimator and a real valued variable respectively. A predictor P(Y|X = x) = f(x) 58 is represented as: At first, create a bootstrapped sample Eq. (18) based on the empirical distribution of the pairs S i = (X i , Y i ), next, using the plug-in concept, estimate the bootstrapped predictor as shown in Eq. (19). Finally, the bagged estimator is represented by Eq. (20).
Moreover, the three hyperparameters used in bagging are the MinLeafSize (minimum number of observations per leaf), NumVariablesToSample (number of predictors to sample at every node), and the NumLearningCycles (number of trees). The first two parameters determine the tree's structure, while tuning the final parameter helps balance efficiency and accuracy.
Boosting (Freund 59 ) is another ensemble method that aims to boost the efficiency of a given learning algorithm. The Least-Squares Boosting (LSBoost) ensemble method is used in this study because it is suited for regression and forecasting problems. LSBoost aims to reduce the Mean Squared Error (MSE) between the target variable (Y) and the weak learners' aggregated prediction (Y p ). At first, median of (Y), represented as ( Y ) is computed. Next, to enhance the model accuracy, several regression trees (r 1 , r 2 , . . . , r m ) are integrated in a weighted manner. Individual regression trees are determined by the following predictor variables (X) 60 : where (w m ) represents the weight for the m model, d represents the weak learners, and η with 0 < η ≤ 1 represents the learning rate.
Kernel regression model. Kernel regression (Nadaraya 61 ) is the most used non-parametric method on account of the virtue of kernel and is undoubtedly known as univariate kernel smoother. In order to achieve a kernel regression, a collection of kernels are locally placed at every observational point. The kernel is set a weight to every location depending on its distance from the observational point. A multivariate kernel regression 62 determines how the response parameter, y i is dependent on the explanatory parameter, x i , as in Eqs. (22) and (23). and where E[ψ i ] = Cov[m(x i ), ψ i ] = 0 , m(.) represents a non-linear function, and ψ i is random with mean zero and variance σ 2 . It describes the way that y i varies around its mean, m(x i ). The mean can be represented as the probability density function f: where Y represents the dependent variable, X 1 , X 2 , . . . , X n represents the n independent variables, a and b represents the regression coefficients and u represents the stochastic disturbance-term that could be caused by an undefined independent variable.
Bayesian optimisation. Bayesian Optimisation (BO) 64,65 is an efficient approach for addressing optimisation problems characterised by expensive experiments. It keeps track of the previous observations and forms a probabilistic mapping (or model) between the hyperparameter and a probabilistic score on the objective function that is to be optimised. The probabilistic model is known as a surrogate of the objective function. The surrogate function is much easy to optimise, and with the help of the acquisition function, the next set of hyperparameters is selected for evaluation on the actual objective function based on its best performance on the surrogate function. Hence, it comprises a surrogate function for determining the objective function and an acquisition function for sampling the next observation. In BO, the objective function (f) is obtained from the Gaussian Process (GP) as described in Eq. (26).
where µ and ϑ are calculated from the observations of x 66 . We select the best performing algorithm among the above-discussed models with the optimised hyperparameter. Lastly, we evaluated the performance of the best-performing algorithm using the test dataset. A flowchart of the detailed methodology is illustrated in Fig. 1.

Results
Predictor importance and sensitivity. We plotted the relative predictor importance score of each predictor along with their respective box plot for a better visual representation of the datasets (Fig. 2). We found that the relative predictor importance score ranges approximately from 9 to 152. The higher the value of the relative estimate, the more relevant is the predictor in estimating the response variable (i.e., k-barriers). We found that out of these four predictors, the transmission range of the sensor emerges as the most relevant predictor in predicting the required number of k-barriers for fast intrusion detection and prevention considering Gaussian node distribution over a rectangular region. The number of sensors also shows good relevancy in predicting the response variable and ranked second. The area of the region of interest and the sensing range of the sensor shows fair relevancy and ranked third and fourth, respectively.
We also evaluated the impact of each predictor on the response variable. We plotted the partial dependence plot for each possible pair of predictors (Fig. 3a-f). For a better visual inspection, we also plotted the www.nature.com/scientificreports/ three-dimensional plot and its two-dimensional illustration. We observed that the area of the RoI has a slightly negative impact on the target variable i.e., the response variable decreases with an increase in the area of the RoI. However, an inverse relationship is observed with all other predictors. The sensing range of the sensor, the transmission range of the sensor, and the number of the sensors have a positive impact on the response variable i.e., the response variable increases with an increase in these predictors.
Model performance. We iteratively selected the best machine learning model with optimised hyperparameters value using the Bayesian optimisation 67-69 on the 80% of the datasets (Fig. 4). We used Eq. (27) as the objective function (Obj) to select the best machine learning model with optimised hyperparameters.
where valLoss is the cross-validation mean square error (CV-MSE). At each iteration, the value of the objective function is computed for any one of the participating models. The model (with optimised hyperparameters), which returns the minimum observed loss (i.e., the smallest value of the objective function so far), is considered as the best model. After iterating for 120 iterations, the AutoML algorithm returned the GPR model as the best model along with the optimal hyperparameters (i.e., for the GPR model; sigma = 0.98 ). Before returning the model, the AutoML algorithm retrains the GPR model on the entire training dataset. Once we get the trained GPR model, we evaluate its performance on the training datasets to estimate the training accuracy. We found that the model performed well on the training datasets with a correlation coefficient (R = 1), root mean square error (RMSE = 0.003), and bias = 0. However, for an unbiased evaluation, we evaluated the performance of the trained model on the test datasets (i.e., 20% of the total datasets). In doing so, we fed the testing predictors into the trained GPR model and obtained the predicted response. We then compared the GPR predicted k-barriers with the observed values (Fig. 5a). We found that the GRP model performs prodigiously with a R = 1, RMSE = 0.007, and bias = − 0.006. All the data points are aligned along the regression line and lie well inside the 95% Confidence Interval (C.I).
Further, to assess the appropriateness of the plotted linear regression plot, we performed residual analysis. We plotted the time series of the observed and the predicted values along with the corresponding residual values (Fig. 5b). We found that the residuals are significantly low and do not follow any pattern, which indicates a good linear fit.
To understand the distribution of the error (i.e., difference of predicted and observed values), we performed error analysis using error histogram (Fig. 6). To do so, we plotted the error histogram using ten bins. The error ranges from −0.00997 from the left to 0.00356 on the right of the histogram plot. We found that the error follows a right-skewed Gaussian distribution. The peak of the distribution lies in the underestimated region. Lastly, we presented the results of the remaining algorithms of the AutoML (i.e., SVR, BDT, Bagging ensemble learning, Boosting ensemble learning, kernel, and linear regression) in Table 2. We found that the best performing AutoML algorithm (i.e., GPR) outperforms all the other algorithms.

Discussion
We observed that the AutoML approach successfully selects the best machine learning model among a group of explainable machine learning algorithms (i.e., among SVR, GPR, BDT, bagging ensemble learning, boosting ensemble learning, kernel regression, and linear regression model) and optimised its hyperparameters. However, we have compared the AutoML derived results with the benchmark algorithms for an unbiased and fair evaluation of the proposed approach. We selected Feed-Forward Neural Network (FFNN) 70 , Recurrent Neural Network (RNN) 71 , Radial Basis Neural Networks (RBN) 72 , Exact RBN 73 , and Generalised Regression Neural Network (GRNN) 74 as the benchmark algorithms. We selected these algorithms because they are frequently used in diverse applications such as remote sensing, blockchain, cancer diagnosis, precision medicine, decease prediction, self-driving cars, streamflow forecasting, and speech recognition; hence have high generalisation capabilities 37,[75][76][77] . In doing so, we trained these algorithms over the same datasets. We found that the AutoML outperforms all the deep learning benchmark algorithms (Table 3). Among the benchmark algorithms GRNN performs the best (with R = 0.97, RMSE = 64.61, Bias = 60.18, and computational time complexity, t = 2.23 s). Surprisingly, all the benchmark algorithms have a high positive bias value. It indicates that these models highly overestimate the number of required k-barriers. We have also compared the performance of the AutoML with previous studies 21, 22 for the prediction of k-barriers and k-barrier coverage probability (Table 4). Further, we also tested the performance of the AutoML approach over the publicly available intrusion detection dataset 22 . In a recent study, Singh et al. 22 have proposed a log-transformed feature scaling based algorithm (i.e., LT-FS-ID) for intrusion detection considering uniform node distribution scenario. We downloaded the datasets and applied the proposed AutoML approach to them. In doing so, we iterated the AutoML for 120 iterations using the Bayesian optimisation to obtain the best optimised machine learning model. We found that AutoML approach perform well over the dataset (with R = 0.92, RMSE = 30.59, and Bias = 18.13). Interestingly, the same GPR algorithms emerges as the best learner algorithms with a optimised sigma = 0.33. It highlights the potential of the GPR algorithm for intrusion detection, which becomes more apparent from the recently published literature's 21,78 .
The proposed AutoML approach for estimating the k-barriers for fast intrusion detection and prevention is highly user-friendly and provides a fast solution. It reduces the confusion of selecting the best-performing algorithm by automating the process. Further, it also overcomes the limitation of the LT-FS-ID algorithm 22 . LT-FS-ID algorithm only works if the input predictors are a positive real number. It will not work if any input predictors contain zero (or negative values). Although the AutoML approach gives the best result, its performance will hamper with the sensor aging. In other words, with the aging effect in the sensors, the quality of the data recorded by the sensor may change drastically (i.e., datasets become dynamic), resulting in performance degradation. In such a situation, retraining the proposed model will solve the problem.

Conclusion
In this study, we proposed a robust AutoML approach to estimate the accurate number of k-barriers required for fast intrusion detection and prevention using WSNs over a rectangular RoI considering the Gaussian distribution of the node deployment. We found that the synthetic predictors (i.e., the area of the RoI, sensing range of the sensor node, transmission range of the sensor node, and the number of sensors) extracted through Monte Carlo simulations successfully mapped with the k-barriers. Among these predictors, the transmission range of the sensor emerges as the most relevant predictor, and the sensing range of the sensor emerges as the least relevant predictor. In addition to this, we observed that only the area of the RoI has a slightly negative impact on the response variable. We then iteratively run the AutoML algorithms to obtain the best machine learning model among the explainable machine learning model using Bayesian optimisation techniques. We found that  www.nature.com/scientificreports/ the AutoML algorithm selects the GPR algorithm as the best machine learning model to map the required k-barriers accurately. We evaluated the potential of the GPR algorithm over unseen test datasets. We found that the AutoML elected algorithm performs exceptionally well on the test datasets. We further compared the AutoML results with the benchmark algorithms for a more reliable and robust conclusion. We found that AutoML outperforms all the benchmark algorithms in terms of accuracy. For more generalisation of this approach, we tested the efficacy of the AutoML over the publicly available datasets on intrusion detection using WSNs, and we found a similar performance. This study is a step towards a cost-efficient approach for fast intrusion detection and prevention using explainable machine learning models.

Data availability
The datasets generated during and/or analysed during the current study can be made available from the corresponding author on a reasonable request.

Code availability
The computer algorithms originated during the current study can be made available from the corresponding author on a reasonable request.