The national Immunization program is one of the most economically advantageous health treatments, with tested methods for reaching the most vulnerable and difficult-to-reach groups in both developing and developed countries1,2,3,4. World Health Organization (WHO) launched the Expanded Program Immunization (EPI) in 1974 to ensure universal access to all vaccines for all targeted groups, including children, adolescents, and adults5. According to the WHO guidelines, the national EPI now aims to immunize infants between the ages of 0 and 23 months (about 2 years) against eight vaccine-preventable childhood illnesses, such as one dose of measles, three doses of polio, one dose of Bacillus Calmette-Guerin (BCG), and three doses of pentavalent6. Hence, a child is fully vaccinated if he/she has received all eight doses of vaccination listed above.

According to the United Nations Children’s Fund (UNICEF) and WHO report in 2019, almost 20 million children were either unvaccinated or had incomplete immunization, making them more susceptible to mortality and morbidity7. Of the 20 million children worldwide, who missed vaccination in 2019, over 60% were from 10 countries, many of whom live in countries with weak health systems7,8. The global under-five mortality rate declined from 59%, which was 93 deaths per 1000 live births in 1990, to 38% in 2021, due to huge portion of immunization9. COVID-19 pandemic and related disruptions have put a burden on health systems, resulting in 25 million children missing out vaccinations in 2021, a number that is 5.9 million higher than in 2019 and the largest amount since 200910.

Although WHO's goal is to make vaccination services accessible to everyone worldwide by 2030, about 13.5 million children did not receive the first dose of a vaccine due lack of access to vaccination services11. Ethiopia had over 10.9 million children under the age of one who missed the first dose of measles between 2010 and 2018, which was the highest amount12. Through effective vaccination programs, COVID-19 demonstrates the vital role that vaccines play in illness prevention, lifesaving, and promoting a wealthier future13.

Even though that Africa has made remarkable progress in immunization services, according to the 2013 immunization data report, vaccine coverage was 75 percent, and Ethiopia has the second largest number of unvaccinated children in the region, next to Nigeria14. More children in Africa have lost their immunizations in recent years as the number of births has increased and immunization programs have stagnated12. In Ethiopia, the prevalence of complete childhood vaccination status among children aged 12–23 months increased from 24.6 to 39% between 2011 and 2016, respectively11. Despite this, according to a recent systematic review and meta-analysis report, one in two children was not vaccinated or four out of ten children had incomplete vaccine in Ethiopia15.

Several researches have been conducted to investigate the potential factors associated with incomplete immunization through the application of classical statistical analysis techniques16,17,18,19,20 based on prior assumptions that could limit the potential to discover hidden knowledge. In contrast, machine learning algorithms are designed to make the most accurate predictions possible, enabling systems to learn from data rather than making prior assumptions21. There are still high rates of incomplete childhood immunization, which require further investigation to prioritize and promote childhood vaccination to ensure the health and well-being of all children in east Africa. Therefore, this research was aimed to predict incomplete immunization among under-five children in East Africa using machine learning algorithms.


Study design and setting

Demographic and Health Survey (DHS) used population based cross-sectional survey study design to collect data and this study employed predictive modeling approach. Secondary data of six east African countries namely Burundi, Ethiopia, Madagascar, Uganda, Rwanda, and Zambia DHS dataset from 2016 to the recent 2021 were considered for this analysis.

Source and study population

Source population includes all mothers aged 15–49 years who had children under the age of five while all mothers aged 15–49 years who had children under the age of five and started immunization for their children were considered as source population.

Inclusion criteria

Mothers with children aged 12–35 months who had begun immunization were included in the study.

Data source, Sample size and sampling procedure

Data source

Data was obtained from the MEASURE of DHS program22. The DHS is a nationally representative survey that collects data on basic health indicators such as mortality, morbidity, family planning service utilization, fertility, maternal and child health services (vaccination). Each country’s survey consisted of different datasets including men, women, children, birth, and household datasets.

Sample size determination and sampling procedure

A total of 27,806 weighted sample and 27,691 actual sample were considered from six east African countries (Burundi, Ethiopia, Madagascar, Uganda, Rwanda, and Zambia) as shown in Table 1.

Table 1 Sample size determination for incomplete immunization in east Africa DHS 2016–2021.

DHS used two stages of stratified sampling technique to select study participants. In the first stage, Enumeration Areas (EAs) were randomly selected whereas in the second stage households were selected. The survey datasets were accessed through the web page of the International DHS Program after subscription and appropriate letter is acknowledged.

Study variables

Incomplete immunization in children under the age of five were outcome variable categorized as 1 = Yes (children who had not completed the full dose of vaccination) and 0 = No (those who had received the full dose of vaccination). Baseline explanatory variables were selected from previous studies14,16,18,23,24,25,26,27,28,29,30. Thus, sociodemographic factors include mothers age, marital status, mothers’ occupation, mothers’ educational level, husband education, place of residence and sex of household head. socioeconomic factors include wealth index and media exposure while reproductive(obstetrics) history factors include mothers’ history of ANC follow-up, place of delivery, sex of child, number of living children, birth order, child size at birth, PNC visit and preceding birth interval were independent variables.

Operational definition

Incomplete immunization: “children who started vaccination and missed at least one dose from eight recommended vaccination at any time instance between 1 and 12 months”31,32.

Complete immunization: “when children had been vaccinated for all recommended vaccination (one dose of BCG, three doses of polio, three doses of pentavalent, and one dose of measles)”32.

Data management and analysis

Data extraction was carried out using Stata version 17, and then imported to Jupyter Notebook for further analysis. Sample size weighting was used to draw valid inference. Data were thoroughly cleaned, and missing values were imputed to ensure completeness. Outlier detection was performed to identify and remove extreme values that could have skewed the analysis. Python 3 programming language in Jupyter Notebook using imblearn, sklearn33, XGBoost34 and SHAP35 packages were utilized to perform the necessary calculations and analysis.

Machine learning framework for prediction of incomplete immunization

A general framework utilized in earlier research36 was created (Fig. 1) based on Yufeng Guo's seven machine learning processes37, to predict incomplete vaccination. All machine learning algorithms and techniques were implemented using Python version 3.10.11 programming language in Jupyter Notebook.

Figure 1
figure 1

Overview of machine learning framework for prediction of incomplete immunization applied. LR logistic regression, RF random forest, KNN k-nearest neighbor, ARN neural network, SVM support vector machine, NB Naive Bayes, XGB eXtreme gradient boosting, DT decision tree.

Data collection and preprocessing method

The dataset for this study was extracted from Demographic and Health Survey website and obtained upon a formal request after subscription and registration on their system. A total actual sample of 27,691 under-five children who started vaccination then appropriate data preparation was performed to make data suitable for ML task.

Missing data were managed using various imputation procedures to fill incomplete fields with statistically relevant substitutes. The k-nearest neighbor (KNN) technique has proven to be typically effective for missing value imputation38. In this study a simple imputer class of scikit-learn module mode for categorical data and KNN for numerical data were used for imputing missing values in the dataset. Outliers were identified using a boxplot and replaced using the Interquartile Range (IQR) scores for the next step.

Before fitting the ML model, feature engineering was applied. Among various data transformation techniques, we used One Hot Encoder and label Encoder to encode categorical variables into numeric values and min–max normalization technique was used for scaling. Standard balancing strategies including random under-sampling, random over-sampling, and the Synthetic Minority Oversampling Technique (SMOTE) were tested to address the unbalanced categories of the outcome variable. As a result, SMOTE outperformed the other resampling techniques on baseline model.

Following feature engineering dimensionality reduction was applied. High-dimensional data may contain a lot of redundant and useless information, which might seriously reduce how well learning algorithms work39. The mutual information and variance threshold from filter method, Recursive Feature Elimination (RFE) from wrapper method and Boruta feature selection method were tested and compared their performance on baseline model for feature selection technique.

Since every ML need training and test dataset, data split was allocated as 80% for training and 20% for testing. The popular k-fold cross validation approach was utilized to ensure the performance of the model because the train-test split function method has disadvantages that it might result in the data being over-fitted or under-fitted on splitted data. In K-fold method, the dataset is split into ‘k’ sub-samples, in which one sample is used for testing and the rest of the k − 1 data set is used for training purpose33.

Model development methods

The dataset used in the analysis falls under the category of binary classification since incomplete immunization is categorized into two mutually exclusive categories. Accordingly, eight classification algorithms (Logistic Regression, Random Forest, K-nearest neighbor (KNN), Artificial Neural Network, Support Vector Machine, Naïve Bayes, eXtreme gradient boosting (XGBoost), and Decision tree) were fitted for this study. These methods were chosen based on prior research that used machine learning techniques for classification tasks using DHS data, with each country's performance taken into account40,41,42,43,44,45,46,47.

To verify the algorithm’s performance in terms of classifications, a confusion matrix (also known as an error matrix) and Jaccard score is used. It summarizes the actual and predicted classifications of a dataset and shows the number of correct and incorrect predictions, which are further categorized into true negatives, false negatives, true positives, and false positives. Additionally, the importance and effect of each variable's contribution on the outcome were identified using SHapley Additive exPlanations (SHAP). SHAP is a game theoretic approach to explain the output of any machine learning model that connects optimal credit allocation with local explanations using the classic Shapley values from game theory and their related extensions35. Furthermore, receiver operating characteristic curve AUC was used for visualizing summary of performance ML models. Detail of confusion matrix were adapted from48 and presented as follows:


Predicted positive

Predicted negative

Actual positive

True positive (TP)

False negative (FN)

Actual negative

False positive (FP)

True negative (TN)

Building an efficient ML model is thought to depend heavily on tuning hyper-parameters, especially for tree-based ML models that include a lot of hyper-parameters49. It might be challenging to determine what values to use for a particular algorithm's hyperparameters on a specific dataset while the process of searching terminates when predefined criteria are satisfied. For this study, hyper-parameter tunning was done using grid search methods. Finally, supervised ML uses the highest-performing classifier with a defined performance to predict incomplete immunization based on identified independent factors.

It is not over yet because we are still ignorant of the precise feature categories connected to incomplete immunization. For this purpose, rule generation was done using best performed model. Association rules are IF–THEN rules that are particularly significant since they are simple to understand and limit the attributes chosen for the model during rule generation to those that are pertinent50.

Ethics considerations and consent to participate

Since the study was a secondary data analysis, participant consent was not necessary. Permission for data access has been granted from the Demographic and Health Survey (DHS) measure through an online platform by filling all requirements needed to access data from The IRB-approved procedures for DHS public-use datasets do not allow respondents, households, or sample communities to be identified. There are no names of individuals or household addresses in the data files.


Sociodemographic characteristics

This study included a total weighted sample of 27,806 children under five. According to the data about 79.54% of participant mothers reside in rural areas. Sex of household head: Males accounted for 79.22% of the household heads. About 20% were female, approximately 85% were married, and only 6.23% and 9% single and widowed or divorced, respectively. Of all mothers in the total country, only 3.62% were professional workers, and most of them (71.78%) were not professional workers. Above half, (55.35%) of husbands took primary education, and still, 21.37% have no regular education. Of all, only 23.28% completed secondary and above level education. Surprisingly, half (49.32% of mothers) took primary education, and 27.25% and 21.37% had no regular education and completed secondary or above by level education, respectively. A summary of sociodemographic characteristics is shown in Table 2 and (Fig. 2) for mothers' ages.

Table 2 Sociodemographic characteristics of incomplete immunization among under-five children in east Africa DHS 2016–2021.
Figure 2
figure 2

Line plot graph of mothers' age vs immunization status in east Africa DHS 2016–2021.

Reproductive (obstetrics) history characteristics

The data reveals that the majority of children born to the participants were male, accounting for 51.02% of the total number. Furthermore, most participants did not receive PNC checkup, which is concerning, as it is an important aspect of postnatal care. However, it is reassuring to note that 21.38% of the participants did receive PNC checkup. In terms of place of delivery, most deliveries took place in a health institution, which is a positive indicator of access to healthcare services. However, it is important to note that home deliveries still accounted for a considerable proportion of births, at 32.86%. Finally, the majority of children were of average size at birth accounts 68.93%, while 21.14% were small and 9.93% were large. A summary of reproductive history characteristics is shown in Table 3.

Table 3 Reproductive (obstetrics) history characteristics of incomplete immunization in east Africa DHS 2016–2021.

Socioeconomic characteristics

According to the data, the wealth index of the participants was distributed as follows: 48.35% were poor, 34.65% were middle class, and 16.99% were rich. It is important to note that many participants were classified as poor, which could have implications for their access to healthcare and other essential services. Regarding media exposure, most participants had access to media, accounting for 51.62% of the total number. However, it is concerning that almost half (48.38%) of the participants did not have access to media, which could limit their access to important health information and education.

Machine learning analysis of incomplete immunization

This study tried to do feature selection using different techniques to reduce the number of features, as shown in (Fig. 3). Mutual information and variance threshold from the filter method, Recursive Feature Elimination (RFE) from the wrapper method, and Boruta feature selection method were tested and compared their accuracy on baseline model. Despite testing various methods, the highest accuracy was achieved when all features were included in the model development process. This may be attributed to the fact that the original features were already extensive and informative, thus including all of them resulted in the best performance.

Figure 3
figure 3

Feature selection methods for incomplete immunization.

Model development and evaluation

Data were splitted into training and test data after being cleaned and balanced. We allocated 80% of the data for training and 20% for testing. Then we developed eight ML models to predict incomplete immunization. All models were fitted on both unbalanced data and balanced data. Finally, each model’s performance was evaluated and compared in the test set before and after balancing in order to select the best predictive model. Accordingly, high performance was achieved after balancing the target variable shown in Table 4.

Table 4 Model performance comparison.

After applying the SMOTE balancing technique, the results showed that the random forest and XGBoost models were the best predictive models, having the same performance with an accuracy of 78.34%, f1-scores 76.76%, and Jaccard scores 62.29% for random forest and accuracy 78.78%, f1-score 76.24%, and Jaccard scores 61.16% for XGBoost.

The model that performs best on balanced data was exposed to hyperparameter tuning, which is random forest and XGBoost classifier models. Since both models have roughly the same performance in this study, hyperparameter tuning was applied using the grid search approach to both the random forest classifier and the XGBoost classifier in order to ensure the best model. A Grid Search method with ten-fold cross validation was used to optimize the hyper-parameters of ML models. Since it is not straight forward to select best parameter, ‘criterion’: ‘entropy’, ‘n_estimator’:100, 200, 500, ‘max_depth’: None, 5, 10, ‘max_features’:‘sqrt’, ‘log2’, None were searched and 'max_depth': None, 'max_features': 'log2', 'n_estimators': 500 ‘random state = 0’ were pulled for random forest model. While 'n_estimators': [100, 200, 500], 'max_depth': [3, 5, 10], 'learning_rate': [0.1, 0.01, 0.001], 'subsample': [0.8, 1.0], 'colsample_bytree': [0.8, 1.0], 'random_state': [0, 42] were searched and 'colsample_bytree': 1.0, 'learning_rate': 0.1, 'max_depth': 5, 'n_estimators': 500, 'random_state': 42, 'subsample': 0.8 were pulled for XGBoost model. After implementing this, XGBoost was still able to outperform random forest, therefore it was employed as a prediction model.

Visualization of feature importance

While classical analysis is more structured and relies on pre-defined rules and formulas like p-value cut point to select significant features, machine learning algorithms are designed to adapt and learn from data. Although ML models are often considered as black boxes because it is difficult to interpret why an algorithm provides accurate predictions on particular problem51; therefore, we introduced the SHAP value in this study. SHAP is a unified framework proposed by Lundberg and Lee52 to interpret ML predictions, and it is a new approach to explain various black-box ML models. We leveraged SHAP to explain our predictive model, which includes related predicting factors that lead to incomplete immunization. The importance of predictors is evaluated by the mean SHAP value, as shown on (Fig. 4). Features with a long bar located at the top are highly related to incomplete immunization. Results from feature importance showed, the number of living children during birth, ANC follow-up history, maternal age, place of delivery, birth order, and preceding birth interval were associated with a higher predicted probability of incomplete immunization among under-five children.

Figure 4
figure 4

Mean SHAP value of feature importance of incomplete immunization.

From Fig. 5, the feature ranking (y-axis) indicates the importance of the predictive model. The SHAP value (x-axis) is a unified index that responds to the influence of a certain feature in the model. In each feature important row, the attributions of all variables to the outcome were drawn with dots of distinct colors, where the red dots represent the high-risk value, and the blue dots represent the low-risk value. Features pushing the prediction higher are shown in red, those pushing the prediction lower are in blue. Furthermore, the rest of the other variables had slightly significant effect to low effect on incomplete immunization.

Figure 5
figure 5

Impact of each variable on prediction of incomplete immunization.

Predicting incomplete immunization

After training, 5698 test samples were used to evaluate the XGBoost model's performance. Out of 2856 incomplete immunization status, the model predicted 2567 of them correctly as incomplete (true positive). And out of 2842 complete, the model predicted 1935 of the as complete (true negative). But the model misclassified 907 true complete immunization status as incomplete (false positive) and 289 true incomplete as complete immunization status (false negative) as shown on (Fig. 6). Overall, the model predicted with an accuracy of 79.01%, recall of 89.88%, F1-score of 81.10%, and 73.89% precision on test data.

$${\mathbf{Accuracy}} = {\text{ TP }} + {\text{ TN }}/ \, \left( {{\text{TP }} + {\text{ FP}} + {\text{FN}} + {\text{TN}}} \right) \, = > { 2567} + {1935 / 2567 } + {289} + {9}0{7} + {1935 } = {\mathbf{79}}.{\mathbf{01}}\%$$
$${\mathbf{Precision}} = {\text{ TP }}/ \, \left( {{\text{TP }} + {\text{ FP}}} \right) = > {2567/ 2567 } + {9}0{7 } = {\mathbf{73}}.{\mathbf{89}}\%$$
$${\mathbf{Recall}} = {\text{ TP }}/ \, \left( {{\text{TP }} + {\text{ FN}}} \right) = > {2567/ 2567 } + {289 } = {\mathbf{89}}.{\mathbf{88}}\%$$
$${\mathbf{F1}}\,{\mathbf{score}} = \, \left( {{2 } \times {\text{ Precision }} \times {\text{ Recall}}} \right) \, / \, \left( {{\text{Precision }} + {\text{ Recall}}} \right) \, \left( {{2} \times 0.{8988} \times 0.{7389}} \right) \, / \, 0.{8988 } + \, 0.{7389 } = {\mathbf{81}}.{\mathbf{10}}\%$$
Figure 6
figure 6

Confusion matrix’s of XGBoost model prediction on test data.

Area under the receiver operating characteristic curve (AUC) (Fig. 7) was used to summarize model performance overall thresholds and thus misclassification error weightings. XGBoost model produced an area under the curve of 66% on unbalanced data, whereas after balancing and hyperparameter tuning, the prediction on test data produced an area under the curve of 86% which indicates a good predicting model. Below the figure green line shows the model after balancing and tuning while the orange line shows AUC on unbalanced data.

Figure 7
figure 7

Comparison of XGBoost model prediction on test data.

Association rule mining

Association rule mining is a technique used to discover interesting relationships between variables in large datasets53. For this study association rule mining was done using Apriori algorithm to identify the precise category that is linked with incomplete immunization. Before applying association rule mining data discretization was performed for the variables that were not categorical at all. Thus, mothers’ age was categorized as (15–24, 25–34, 35–49). Number of living children categorized as (1–3, 4–6, and > 6). Preceding birth interval categorized as (< 25, 25–48, > 48). ANC follow up categorized as (no visit, 1–4 and > 4). Birth order is categorized as (1st, 2nd and 3rd and above 3rd). Apriori algorithm produces 13 rules connected to target category 1 that replace incomplete immunization status with more than 70% confidence level, but only four rules with more than 85% confidence level were generated.

Rule1 IF ('mothers_age_15-24’, ‘delivery_place_home', ‘ANC_follow_1-4' THEN target_1 confidence 89.9% lift 1.696.

Rule2 IF ('mothers_age_15-24', ‘ANC_follow_1-4', 'mothers_edu_Nedu’, ‘delivery_place_home') THEN target_1 confidence 87.4% lift 1.695.

Rule3 IF 'mothers_age_15-24', 'delivery_place_home', 'mothers_edu_Nedu') THEN target_1 confidence 85.58% lift 1.678.

Rule4 IF ‘ANC_follow_1-4','delivery_place_home', 'mothers_edu_Nedu') THEN target_1 confidence 85.5 lift 1.


Recently COVID-19 pandemic and related disruptions have put a burden on the health systems. This results in 25 million children missing out vaccinations in 2021, a number that is 5.9 million higher than in 2019 and the largest amount since 200910. This study was conducted to predict top risk factors of incomplete immunization among children under five. Eight supervised machine learning algorithms were trained on both balanced and imbalanced data for prediction purposes. The performance of those eight ML models was compared by their classification accuracy, f1-score and Jaccard score. SMOTE's data balancing approach outperformed models developed using unbalanced data in terms of accuracy, f1 score, Jaccard score and Area under curve score. In this study XGBoost and random forest performed best same result on balanced data. But after applying hyperparameter XGBoost model improved performance over the random forest with an accuracy of 79.01%, recall of 89.88%, F1-score of 81.10%, precision 73.89%, and AUC 86% while random forest was chosen on study conducted in Sindh province, Pakistan54 on predicting elevated risk of defaulting from immunization. This may be due to the fact that they did not test XGBoost model on their research. Final prediction was made on test data after optimizing hyperparameters of XGBoost classifier in turn improved AUC. The model predicted 2567 true positive (true case of incomplete immunization) 1935 true negative (true complete immunization) and misclassified 289 as complete and 907 as incomplete.

Accordingly, top features were identified by SHAP mean value based on their importance in predicting incomplete immunization after model is tuned on XGBoost. In addition, the contribution of each feature to the prediction for incomplete immunization and model accuracy, were identified using SHAP impact (on model output). Those with red dots have high predictive probability or pushing the prediction higher, in contrast feature located at the bottom of tree explainer with blue color were low predictive probability or pushing the prediction lower. This research found number of living children during birth, ANC follow-up history, maternal age, place of delivery, birth order, preceding birth interval were the top associated with a higher predicted probability or pushing factor to incomplete immunization among under five children in east Africa. This result is supported by previous studies done in east Africa using multilevel analysis have shown that factors such as birth order, ANC follow up, place of delivery, preceding birth interval, maternal age have profound influence on mother’s health-seeking behavior and child immunization status16.

Another aim of this research was to identify specific categories that are associated with incomplete immunization. Association rule mining was employed to identify which category is more associated with incomplete immunization among children under five in east Africa. The analysis revealed that children whose mothers had no education, were delivered at home instead of a health institution and ANC follow (1–4 times) were highly associated with predictive probability to have incomplete immunization. Additionally, the study found that younger mothers (15–24) were also associated with incomplete immunization.

In our research findings, younger mothers (15–24) were associated with incomplete immunization among children under five in east Africa. This may be due the fact that older mothers had childcare experience which the young mothers are yet to acquire. Additionally, older women may be more willing to continue immunizing their children since they may have previously had children who received vaccinations and had no negative side effects. Similarly, a study conducted in Nigeria18 Ethiopia14 and Kenya55 agreed that Children of young women (15–24 years) are more likely to be incompletely immunized when compared with children of older women. Possible explanation could be this study attributed to large samples and included more areas beyond Ethiopia.

Results from rule generation revealed that children whose mothers had no education were associated with incomplete immunization. A similar association between maternal education and child immunization has been reported in several other studies, including Togo56 Nigeria18 Athens Greece57 Hadiya zone, Ethiopia58 systematic review across the globe59. Indeed, a woman's education has a demonstrable impact on her ability to acquire information about the usage of health services in general and vaccination services in particular, as well as her level of living. According to research, education has a significant impact on mothers' health-seeking habits, including child vaccination18. Education also makes it simpler for women to communicate with medical experts, leading to a better understanding of and ability to absorb knowledge about actions that enhance children's welfare. In contrast study conducted in rural of Mozambique showed Mothers' educational levels had no influence on the child's vaccination status60. This may be due study conducted only on rural area since residence variation have impact on mothers’ education related factor like media exposure in addition small sample size is not representative which led to bias.

Home delivery was associated with incomplete child immunization in East Africa. This finding is in line with the studies conducted in Nepal61 India62 Tigray northern Ethiopia63 Madagascar64 Kenya55. Similarly home delivery was reported to be a risk factor in case–control studies65 and systematic review66 conducted in Ethiopia. The explanation could be women who give birth at a home are less likely to be aware of their own and their children’s health status than institutional delivery. According to a systematic review and meta-analysis conducted in Ethiopia, women who gave birth at home were 3 times more likely to have incompletely immunized children than women who delivered at health facilities67.

Our study also shows that the utilization of health services such as ANC can be an important factor for the incompleteness of children’s vaccination status. This is consistent with the study conducted in Tanzania68 Senegal69 and a previous study in East Africa by multilevel analysis16. This might be explained by the fact that mothers obtaining sufficient helpful information about kid immunizations at ANC visits, giving them confidence in their children's preventive health. This result, also supported by systematic review and meta-analysis from Ethiopia, revealed that ANC follow-up services were found to be significantly associated with incomplete vaccination67. Indeed, ANC follow-up is important for child immunization as it allows healthcare providers to monitor the mother's health during pregnancy, ensuring that the child receives the necessary vaccinations.

Strength and limitation of the study

This study has the following limitations. Frist recall and social desirability biases. Although the DHS program is typically regarded as one of the most trustworthy sources of quantitative data, particularly maternal and child health, it may be that the responses were affected by recall and social desirability biases. While acknowledging these and other limitations inherent in national demographic surveys of this kind, the surveys still offer the greatest population-based data currently available, encompassing all the nation's provinces and regions and guaranteeing external validity or generalizability.

Nevertheless, this research has several strengths, one of which is the utilization of machine learning techniques that learn from data rather than relying on prior assumptions as in classical analysis methods. Furthermore, this study provides an invaluable contribution to immunization status literature in context of machine learning.


This study was conducted with aim of predicting and identifying predicting factors of incomplete immunization in east Africa. Using SHAP mean values and SHAP plots, we proved that the ML method can illustrate the influence of key features and establish a high-accuracy incomplete immunization prediction model. The illustration of cumulative domain-specific feature importance and visualized interpretation of feature importance can allow policy makers and immunization program manager on respective study area to intuitively understand the decision-making process for incomplete immunization among under-five children. Prior to this, number of living children during birth, ANC follow-up history, maternal age, place of delivery, birth order, preceding birth interval must all be taken into consideration while implementing health policies intended to reduce the incomplete immunization. Family planning programs should focus on the number of living children during births and preceding birth interval, by enhancing mothers’ education for respective country. We highly recommend promoting institutional delivery and increasing the number of ANC follow-ups by more than four times. It is essential that all stakeholders like Eastern Africa regional coordination center (RCC) take appropriate measurements to ensure that the immunization process is accessible to all children in the country.