Prediction of the Vaccine-derived Poliovirus Outbreak Incidence: A Hybrid Machine Learning Approach

Recently, significant attention has been devoted to vaccine-derived poliovirus (VDPV) surveillance due to its severe consequences. Prediction of the outbreak incidence of VDPF requires an accurate analysis of the alarming data. The overarching aim to this study is to develop a novel hybrid machine learning approach to identify the key parameters that dominate the outbreak incidence of VDPV. The proposed method is based on the integration of random vector functional link (RVFL) networks with a robust optimization algorithm called whale optimization algorithm (WOA). WOA is applied to improve the accuracy of the RVFL network by finding the suitable parameter configurations for the algorithm. The classification performance of the WOA-RVFL method is successfully validated using a number of datasets from the UCI machine learning repository. Thereafter, the method is implemented to track the VDPV outbreak incidences recently occurred in several provinces in Lao People’s Democratic Republic. The results demonstrate the accuracy and efficiency of the WOA-RVFL algorithm in detecting the VDPV outbreak incidences, as well as its superior performance to the traditional RVFL method.

www.nature.com/scientificreports www.nature.com/scientificreports/ outbreaks significantly depends on gathering data from clinicians or laboratories and developing associated central information repositories. These are usually inefficient processes that might lead to further spread of disease [5][6][7][8][9] . Consequently, the important and yet to be solved issue related to PV surveillance is how to rapidly unveil outbreak incidences. A powerful solution to deal with this issue is machine learning (ML). ML has been increasingly utilized for solving complex real-world problems, its application in public health arguably needs more attention. In this context, the ML methods have been successfully applied to in public health problems such as the real-time detection of foodborne illness 10 , and syndromic surveillance that depends on the reporting symptoms of the patients 11,12 . Tessmer et al. 13 proposed various ML techniques such as artificial neural networks (ANN), convolutional neural network (CNN), and long-short term memory (LSTM) to determine the parameter of basic reproduction number. These methods were applied to epidemiological data from outbreaks of influenza A(H1N1) pdm09, mumps, and measles. Moreover, the ML methods are used for syndromic surveillance based on chief complaint field to detect disease outbreaks. For example, Lee et al. 14 compared two recurrent ANN models based on LSTM and gated recurrent unit (GRU) cells, multinomial naive Bayes (MNB) and support vector machine (SVM) to improve the syndromic surveillance. Volkova et al. 15 utilized ANNs to forecast the influenza-like illness dynamics for military populations. To the best of our knowledge, however, most of the machine learning prediction models in public health are based on ANNs and their extensions (e.g. 16,17 ). Although the traditional ANN method is a powerful method for classification, clustering, and regression 18,19 , certain limitations are reported due to its basic structures, namely, the trapping in local minima and initialization process that involves assigning initial random values to the weights of the network 20 . Those limitations severely impede the applications of ANN-based methods in public health.
To overcome the critical issues in ANN, random vector functional link network (RVFL) has been developed as a single feed-forward neural networks based on a randomized algorithm 21,22 . Thanks to the growing concept of randomization, the RVFL method considers the link between inputs and outputs and therefore, effectively overcomes the limitations of traditional ANN algorithms. On this basis, the weights connecting the input and hidden layers are randomly generated and then fixed during the updating phase using Moore-Penrose pseudo-inverse theory 23 . RVFL has also been reported with other features, e.g., fast convergence 24 , good approximation capability 22 , and compatibility for real-time applications with simple implementation of hardware 20 . Given its unique characteristics, RVFL has been used in several applications including remote sensing 25 , big data analytics 26 , forecasting temperature distribution 27 , short-term electricity load demand forecasting 28 , time-series data prediction 29 , language handwritten script recognition 30 , and semi-supervised learning 31 . However, the efficiency of RVFL is significantly affected by its parameters. Studies have been conducted to determine the influence of parameters on the RVFL's efficiency. Park et al. 17 concluded that a significant effect was found on the performance of RVFL when direct links were used between input and output layers. Additionally, the Radbas function provided RVFL with higher ability of reaching targets compared to using sign or hardlim as activation function 17 . Li et al. 32 investigated the relation between the domain of hidden parameters and the performance of RVFL and found that it was not suitable to generate hidden weights from fixed domain such as [−1,1] 32 . Zhang and Suganthan 33 conducted a comprehensive study to find the best parameters that enhance the performance of RVFL. In the same manner to traditional ANN, the process of randomly selecting RVFL network parameters typically leads to high complexity. Taking the advantages of the swarm optimization algorithm that emulates the social behavior of the whales to attack their prey 34 , whale optimization algorithm (WOA) offers a powerful tool to address the problem of finding suitable configuration in RFVL.
The classes include the IgG antibodies in Children (n = 1216) and adults (n = 1228), including health care workers and blood donors. Antibody titers in a subset of classes resulted from microneutralization show 92% of children class had anti-poliovirus antibodies. On the other hand, the antibodies seroprevalences were 81.7% and 71.9% in adult blood donors and healthcare worker, respectively. Noteworthy, both children and adult classes show the neutralizing antibodies against one of the three poliovirus serotypes and had antibodies against all serotypes. These findings were compatible with the epidemiology of the outbreak [41].
The classification supports the medical field to optimize the evaluation of the vaccination schemes in diverse cohorts using the seroprevalence of poliovirus antibodies. Additionally, to sustain the value of an ELISA in the developed countries with specific epidemiological nature. To date, acceptable underestimation of vaccine scheme in children by ELISA resulted; however, the low sensitivity of the ELISA in the adults. Thus, the classification paradigm supports ELISA to be a reasonable alternative to the microneutralization in children classes. Using classification model by countries with uncertain vaccination schemes and limited resources, enable them not only to avoid the risk of outbreaks from poliovirus vaccines but also to prevent the re-importation of wild strains moreover, this will improve ELISA for classes studies to judge the immunization programs.
In this study, we develop a hybrid ML paradigm by implementing WOA in RVFL to accurately track the immunity response of VDPV during the outbreaks. In the hybrid WOA-RVFL method, the domain search for the parameters in RVFL (i.e., number of neurons, activation function, link between input and output) is first determined. Thereafter, a random population is generated in which each solution represents a configuration of the RVFL network. The solutions of the population are updated using the best solutions and the operators of the WOA. The process of updating the solution is repeated until the best configuration is obtained. The results show that the presented hybrid approach lead to improving the performance of the RVFL algorithm for the prediction of the VDPV outbreak incidences.

Methods
In this section, basic details about the RVFL and WOA are briefly described followed by the description of the proposed hybrid WOA-RVFL method.
www.nature.com/scientificreports www.nature.com/scientificreports/ Random vector functional link networks. RVFL benefits from the properties of random weights and the functional link 27 . In general, the RVFL algorithm has the same structure as the single layer feedforward neural network (SLFNN) except for a direct connection between the input and output neurons. This type of connection improves the ability of RVFL to avoid overfitting. Figure 1 shows the structure of the RVFL network. It can be seen that where the neuron at the input layer receives the dataset then each hidden neuron (enhancement) computes its output by: where b j and a j are the bias and the weight between the input and enhancement neurons, respectively. S represents a scale factor updated during the learning process for each dataset. The output of RVFL is computed using the output weight ( ∈ + w R ) n P defined as: (2) where B represents the input matrix to the output layer (i.e., the input data and the output of the enhancement neurons), and it is defined as: In order to update w in Eq. (2), Moore-Penrose pseudo-inverse or the ridge regression 27 can be used as defined, respectively: T T 1 where I and C are the identity matrix and trading-off parameter, respectively. Note that † is the Moore-Penrose pseudo-inverse.
Whale optimization algorithm. WOA was proposed as a swarm algorithm to simulate the behaviors of whales during the process of attacking the prey 34 . This process can be described by two approaches, including (1) encircling and (2) bubble-net.
In the encircling approach, each whale ( updates its location at current iteration (t) based on the distance (D i ) to the prey ( ⁎ x ) as: where  is the element-wise multiplication, and the two coefficients A and b are updated as , and 2 (7) www.nature.com/scientificreports www.nature.com/scientificreports/ In Eq. (7), the parameter a is decreased from 2 to 0 with the increasing of the iterations (i.e., = − a a ta t max , where t max represents the maximum number of iterations). The value of r is randomly generated in [0,1] interval.
In the bubble-net method, the location of the whale x i is updated using spiral, which simulates the movement of x i around ⁎ x using the helix-shaped 34 as: where b is a random number, l is a parameter determine the shape of a logarithmic spiral. The whales can swim around the prey simultaneously using the spiral-shaped path and shrinking circle based on the probability ∈ p [0, 1] as follows: In addition, it is possible to update the location of each whale based on the location of the random whale x r as: The final steps of the traditional WOA can be summarized in Algorithm 1.
The proposed WOA-RVFL method. The proposed method for classification of the VDPV outbreak incidence is based on the integration of the RVFL and WOA algorithms. In WOA-RVFL, WOA is used to find the best configuration of the parameters for the RVFL network. The proposed WOA-RVFL approach consists of two stages: (1) learning stage and (2) evaluating stage. In the learning stage, WOA-RVFL starts with splitting the dataset into training, validation and testing sets, and then generating a random population X with N solutions. Each solution represents one configuration for the RVFL network. Thereafter, RVFL is constructed based on the parameters inside the current solution. The RVFL network is trained using the training set and then validated using the validation set. After evaluating all solutions within population X, the best solution is determined. The population X is then updated by the operators of WOA. These steps are repeated until the termination criteria are met. Meanwhile, the second stage starts with constructing the RVFL network using the best configuration, and then evaluating it the network using the testing data. www.nature.com/scientificreports www.nature.com/scientificreports/ Learning stage. In this stage, the dataset is divided into three sets: training, validation and testing. The training and validation sets are used during this stage. The next step is to generate a population X that contains N cf and each solution has dimension N par as: ij j j j c on par where u j and l j represent the upper and lower boundary of the j th parameter, respectively. In order to explain this process, consider that the current solution is . N h is the number of hidden neurons; Bias is the parameter that determines if there is a bias in the output neurons; link refers to the network direct link to output layer; AF is the Activation Function (hardlim, sign, sig, radbas, sin, and tribas); RT represents the type of randomization methods used to generate the weights here (Uniform, and Gaussian); mode represents the method used to update the weights (regularized least square, and Moore-Penrose pseudoinverse); and Scale m is a parameter representing the scaling the features (i.e., scale the feature for 1) all neurons, 2) each hidden neuron separately, and 3) the range of the randomization for uniform distribution. The next step is to construct the RVFL network using the current solution x i , using the training set to train the current RVFL, and using the validation set to evaluate the trained network and compute the error between the prediction value and original value of the target using the following equation:  www.nature.com/scientificreports www.nature.com/scientificreports/ θ = − Fit 1 (12) where θ represents the accuracy of the current RVFL network. Thereafter, the best solution is selected and the current population X is updated using the steps of the WOA as discussed in Algorithm 1. The process of updating the solutions of X is repeated until the termination criteria are met. evaluation stage. This stage starts with selecting the best configuration of RVFL and evaluating its accuracy on the testing data using different performance measures. The WOA-RVFL classification process is illustrated in Fig. 2. experimental study. The experimental study is conducted in two phases. The WOA-RVFL algorithm is first benchmarked using 11 UCI machine learning datasets [40]. Thereafter, the method is implemented for the prediction of the VDPV outbreak incidences. In order to analyze the performance of the WOA-RVFL method, a set of performance measures is used, including the Accuracy, Precision, and Recall as Accuracy: Precision: (14) and Recall: where TP, TN , FP, and FN denotes the true positive, true negative, false positive, and false negative samples, respectively.
Phase I: UCI Datasets. The performance of the proposed method is evaluated using the widely-used set of UCI datasets given in Table 1. The datasets have different characteristics which makes their classification a challenging problem. For each case, the available datasets are randomly divided into training (80%), validation (10%) and testing (10%) subsets.

Results and discussion
The results of a comparative study between the proposed WOA-RVFL method and the traditional RVFL algorithm are shown Table 2 and Figs. 3-5. The parameter settings that provide the best predictions are as follows: For the WOA algorithm, parameter a was set to 2, and b = 1. Also, the optimal size of population and the total number of iterations were 20. The parameters of the traditional RVFL algorithm were set based on some recommended values 23 and after a trial and error approach. Accordingly, radbas was taken as the activation function (AF), with a Bias and a link between the input and output (i.e. Bias = 1 and link = 1). The ridge regression was used to update the weights (i.e., mode = Ridge Regression). The optimal number of hidden neurons was 200, and Scale m = 1. Both of the algorithms were implemented in Matlab 2017b in Windows 10 64-bit environment using a PC with 4 G RAM and an Intel ® Core ™ i3-3110M Processor. On average, the CPU times for the training of the WOA-RVFL and RVFL algorithms were, respectively, 0.3936 s and 0.3833 s. As seen in Table 2, the performance of the proposed WOA-RVFL is notably better than RVFL in nearly all cases. The Precision, Accuracy and Recall rates of the proposed WOA-RVFL method are higher than RVFL on the training, validation and testing data. This clearly indicates that introducing the WOA into the RVFL algorithm has improved both its learning and generalization capabilities. This superior performance is more noticeable for six datasets (Zoo, Wine, PCMAC1, Hayseroth, HouseVote, Madelon). Moreover, from Figs. 3-5 it can be noticed that the high performance of the proposed WOA-RVFL against the traditional RVFL in terms of Precision, Accuracy and Recall. By analysis the behaviors of the WOA-RVFL during the training phase, it can be observed that the difference between the accuracy, recall, and precision of the WOA-RVFL and the traditional RVFL is nearly 3%, 4%, 2.5%, respectively. Whereas, during the validation phase the difference between them in terms of accuracy, recall, and precision is 6%, 7%, 5%, respectively. Also, by observed the difference between the proposed WOA-RVFL and the traditional RVFL by using the testing set it can be found it is nearly, the same of performance during validation phase, 6%, 5%, and 4%, for accuracy, recall, and precision, respectively.
Moreover, the Friedman (FD) test is used to determine if there is a significant difference between the WOA-RVFL and traditional RVFL. The results of FD are given in Table 3, it can be noticed that the proposed has mean rank better than the traditional RVFL according to the precision, recall, and accuracy among all the tested dataset and the partitions of the datasets (i.e., the row with name average). In addition, there is a significant difference between the WOA-RVFL and RVFL. However, by comparing the results over the training, Validation, and testing set, it can be noticed that there is no significant difference, but the proposed WOA-RVFL has the best mean rank overall these sets.    www.nature.com/scientificreports www.nature.com/scientificreports/ The cohorts included in this study are given as follows: Fully vaccinated children (Cohort 1): • Included 806 children, aged less than 3.5 years.
• All children completed Health Center records of three doses of pentavalent vaccine and of OPV.
• Antibodies against tetanus was used as a proxy for the vaccination session attendance.     www.nature.com/scientificreports www.nature.com/scientificreports/ Children with unknown vaccination status (Cohort 3): • Included 320 children aged less than 9 years • In 2012, samples were measured from Bolikhamxay, Vientiane and Luang Prabang provinces.
Healthcare workers (Cohort 5): • Included 700 people aged between 15 and 69 years in 2013 • Samples were collected in 3 central, 2 provincial and 8 district hospitals located in Vientiane capital, Huaphan and Bolikhamxay provinces respectively.
Similar to the simulations for the UCI datasets, the available datasets were randomly divided into training, validation and testing subsets. Out of 2448 samples, 1958, 244 and 244 sets were taken for the training, validation and testing of the WOA-RVFL and RVFL models. Each model is executed 25 independently runs. Table 4 shows the descriptive statistics of two major input parameters included in the model development namely Age and Titers. The other considered input parameters are the Cohort type which has five groups, Sex input which either male or female, and the Province which include nine places. The output parameter is Polio Immunoglobulin G (IgG) which includes three groups namely positive, equivocal, and negative. Fig. 6 depicts the correlation between the five parameters and with Polio IgG. As seen, the sex parameter has the smallest correlation with the other parameters. Additionally, the Cohort type, Titers, Age, and Province are correlated with Polio IgG with value greater than 0.20.

Results and discussion
A comparison of the predictions made by the WOA-RVFL and classical RVFL methods is given in Table 5. On average, the CPU times for training the WOA-RVFL and RVFL algorithms were, respectively, 3.67 s and 8.62 s for the VDPV outbreak database. As seen in Table 5, the WOA-RVFL model significantly outperforms the RVFL model in terms of Accuracy, Precision and Recall rates. This involves the results for both the training, validation and testing data.
Moreover, the obtained results are in line with what was detected during the outbreak, where participants born before vaccination were significantly less to be seropositive. These results agree with the outbreak epidemiology. Antibodies neutralization against all poliovirus serotypes were diagnosed in all children. Likewise, antibodies neutralization against all serotypes was diagnosed in all health care workers. In addition, the WOA-RVFL method has figured out the IgG in the fully vaccinated 3.5 aged children class. In addition, the antibody seroprevalence of unvaccinated children, from marginalized areas, was found to be lower than vaccinated children. On the other hand, healthcare workers are classified to have a lower seroprevalence antibody than blood donors. Noteworthy, the proposed model categorizes both the children aged less than 1 year and younger adults to have antibodies more than older ages, supporting the idea that antibody levels were negatively associated with age.
However, VDPV outbreaks become ever-more interdisciplinary problem. In this context, scientists need to address how the revolutionary ML approaches can analyze the enormous amounts of data pouring in from epidemiology and immunology to sustain the clinical diagnostic tools [41]. The proposed WOA-RVFL approach presents an efficient methodological contribution to both ML and mathematical programming together with relevant insights into immunization evaluation. The WOA-RVFL analyzed the disparity between the different immunology assays. It is worth mentioning that high-risk countries may benefit from the proposed WOA-RVFL method for evaluating different immunization program. This can be particularly important for the cases that involve uncertain vaccination coverage or emergence virus neutralization tests (VNT).
In the polio-free areas considering seropositivity by ELISA, the proposed WOA-RVFL method can discriminate the trivalent vaccination from vulnerability to VDPV. Nonetheless, the improved ELISA must be serotype distinct, and negativity thresholds should be studied for the specificity and sensitivity. It should be noted that out of the five examined cohorts, both healthcare workers (cohort 5) and children (cohort 1) were analyzed by VNT 35 . The WOA-RVFL method handled the healthcare workers as a practical example of an adult with a high risk during the outbreak since they are at a higher risk for exposure to infections with a possibility to transfer the infection from a specific cohort to another. Thus, implementing healthcare worker in the proposed model helps understand the epidemiology of the outbreak to prevent the spread of disease from health care worker to patients, many of  www.nature.com/scientificreports www.nature.com/scientificreports/ whom may be highly susceptible to infections and related complications. Therefore, it is important to track the immunization and vaccination in professionals and to ensure their ability to perform critical caring for patients.
The WOA-RVFL algorithm can observe the ELISA serologies of the other children and adult cohorts matched with results of groups that tested by VNT 35 . The WOA-RVFL suggests a high efficiency of outreach vaccination activities since the children from the remote area were equally well protected as the fully vaccinated children. The lower seropositivity rates were classified and predicted in fully vaccinated and unknown status children. This is compatible with the first clinical VDPV outbreak cases that occurred in the same area 36 . Given these features, the WOA-RVFL supports the idea of repairing the deficiencies associated with vaccine management that affect directly on vaccination efficacy. phase iii: comparison with other meta-heuristic methods. In this section, the performance of the proposed WOA-RVFL is compared with meta-heuristic techniques which used to determine the optimal parameters of RVFL. These methods include particle swarm optimization (PSO), artificial bee colony (ABC), and sine-cosine algorithm (SCA). The parameter setting for each algorithm is given as the original paper, also, the common parameters such as the number of solutions, and the total number of iterations are set similar to the first experimental. In addition to, in this study, the dataset is divided into training and testing set using the 10-fold cross validation. This mean the dataset is split into 10 sets, one of them is used as testing and the other nine sets are used as training and this process is repeated 10 times until all sets are used as testing set. Table 6 depicts the comparison results between the four algorithms using different measures. From this table it can be observed that the performance of the comparative algorithms has the same performance when the training set is used. Meanwhile, the accuracy of the WOA-RVFL, according to the testing sets, is better performance than other methods. Followed by the SCA-RVFL which allocated the second rank with nearly 97% and the performance of the ALO-RVFL is better than the MFO-RVFL. The same observation can be reached in terms of the precision and recall.   Table 5. Prediction performance of the WOA-RVFL and RVFL methods for classifying the VDPV outbreak incidences.
www.nature.com/scientificreports www.nature.com/scientificreports/ conclusions This study presents a hybrid ML approach to predict the VDPV outbreak incidences. The proposed method called WOA-RVFL integrates the RVFL networks with the robust WOA optimization algorithm. It was shown that WOA notably improves the prediction accuracy of the RVFL network through finding suitable parameter configuration for this algorithm. The classification performance of the proposed WOA-RVFL method is first verified using a number of datasets from the UCI ML repository. The WOA-RVFL algorithm was deployed to track the VDPV outbreak incidences and Polio IgG recently occurred in several provinces in Lao. Based on the results, the WOA-RVFL algorithm is efficient in detecting the VDPV outbreak incidences and outperforms the traditional RVFL method. Future research can focus on implementing the WOA-RVFL algorithm to improve quantitative structure-activity relationship (QSAR) models and to other public health surveillance applications.