Machine learning study using 2020 SDHS data to determine poverty determinants in Somalia

Extensive research has been conducted on poverty in developing countries using conventional regression analysis, which has limited prediction capability. This study aims to address this gap by applying advanced machine learning (ML) methods to predict poverty in Somalia. Utilizing data from the first-ever 2020 Somalia Demographic and Health Survey (SDHS), a cross-sectional study design is considered. ML methods, including random forest (RF), decision tree (DT), support vector machine (SVM), and logistic regression, are tested and applied using R software version 4.1.2, while conventional methods are analyzed using STATA version 17. Evaluation metrics, such as confusion matrix, accuracy, precision, sensitivity, specificity, recall, F1 score, and area under the receiver operating characteristic (AUROC), are employed to assess the performance of predictive models. The prevalence of poverty in Somalia is notable, with approximately seven out of ten Somalis living in poverty, making it one of the highest rates in the region. Among nomadic pastoralists, agro-pastoralists, and internally displaced persons (IDPs), the poverty average stands at 69%, while urban areas have a lower poverty rate of 60%. The accuracy of prediction ranged between 67.21% and 98.36% for the advanced ML methods, with the RF model demonstrating the best performance. The results reveal geographical region, household size, respondent age group, husband employment status, age of household head, and place of residence as the top six predictors of poverty in Somalia. The findings highlight the potential of ML methods to predict poverty and uncover hidden information that traditional statistical methods cannot detect, with the RF model identified as the best classifier for predicting poverty in Somalia.


Background of the study
Poverty reduction has become a crucial global mission, particularly for countries facing significant economic challenges 1 .The United Nations (UN) has outlined 17 Sustainable Development Goals (SDGs) for 2015-2030,

Motivation
The motivation behind this study comes from the urgent need to address the persistent and widespread issue of poverty in Somalia.Despite various efforts and initiatives, poverty remains a significant challenge, affecting a large proportion of the population and hindering the country's overall development.The Sustainable Development Goals (SDGs) provide a comprehensive framework to tackle poverty and promote sustainable development globally.However, it is essential to assess the progress made by individual countries, such as Somalia, towards achieving these goals.
On the other hand, ML methods have emerged as powerful tools for predicting and understanding complex phenomena.While they have been successfully applied to poverty analysis in other developing countries, there is a lack of research utilizing these methods specifically in the context of Somalia.By applying ML algorithms, to predict poverty in Somalia, we can leverage their predictive capabilities of these advanced techniques and gain insights into the key factors driving poverty in the country.

Novelty of the study
To the best of our knowledge, this study represents the first attempt to apply ML methods to predict poverty in Somalia.While previous research has explored this subject in the country using conventional regression analysis, the utilization of advanced ML techniques provides a novel approach to understanding and predicting poverty patterns 29,35 .By employing ML algorithms such as RF, DT, SVM, and logistic regression, we can uncover hidden information and complex relationships between predictors and poverty outcomes that may not be captured by traditional statistical methods.Furthermore, our study extends the analysis to evaluate the country's progress towards achieving the Sustainable Development Goal related to poverty reduction.By examining the specific predictors of poverty and evaluating the performance of different predictive models, our research contributes to the existing literature on poverty analysis in Somalia.Furthermore, it provides valuable insights for policymakers and stakeholders in their efforts to address poverty and promote sustainable development in the country.

Outline of the paper
The remaining sections of this paper are structured as follows: In "Materials and methods" Section, we provide a detailed description of the methodology employed in this study.This includes the data collection and preprocessing procedures, feature selection techniques, model development, and evaluation of model performance.The results of our analysis are described in "Discussion" Section."Conclusion" Section focuses on the discussion of the key findings, their interpretation, and their implications in the context of poverty reduction in Somalia."Limitations of the study" Section summarizes the implications of our research and proposes avenues for future studies aimed at addressing poverty reduction in Somalia.Section Policy implications discusses the limitations of the study.Finally, Section Policy implications presents the policy implications of the study.

Sample in the study
To enhance the quality of life for the Somali population, an extensive survey was conducted across two phases of the SDHS, covering a total of 100,000 households.The survey specifically aimed to capture the perspectives of nomadic communities as well as individuals residing in urban and rural areas, with a focus on understanding their unique needs and challenges.This study includes comprehensive information from 32,298 households, providing a robust dataset for analysis and informing targeted interventions to improve the overall well-being of Somalis.

Outcome variable
The outcome variable is the wealth index, denoted as WI, which serves as a comprehensive measure of a household's overall wealth.A detailed explanation of its construction can be found in the supplementary information (available at the Somalia National Bureau of Statistics website).The original index consists of five categories: "poorest, " "poor, " "middle, " "richer, " and "richest." In our study, we redefined the dependent variable as a binary variable, represented by Y. Specifically, the categories "poorest" and "poorer" were assigned a value of 1, indicating relative poverty within the household.On the other hand, the categories "middle, " "richer, " and "richest" were grouped together and assigned a value of 0, indicating relative affluence within the household.

Predictor variables
After removing the variables used as components in constructing the wealth index (the dependent variable), we preserve the remaining variables that could potentially have associations with poverty.These variables encompass the administrative region, household size, age group of respondents, employment status of the husband, age of the household head, place of residence, education level of the husband, exposure to mass media, sex of the household head, source of drinking water, maternal education, type of toilet facility, and employment status of the mother.

Data preprocessing
The raw data obtained from the SDHS underwent preprocessing to ensure its suitability for analysis.This included data cleaning, missing value imputation, and variable transformation if necessary.

Data cleaning
Data cleaning is a crucial step in ensuring the quality and reliability of the dataset.In this study, a thorough data cleaning process was conducted to identify and rectify errors, inconsistencies, and outliers.Duplicate entries were removed, formatting issues were corrected, and data entry errors were addressed.The data cleaning process resulted in a cleaned dataset that formed the basis for subsequent analyses, as shown in Figure 1.

Missing values imputation
Missing Value imputation is a critical step in data preprocessing to address the issue of missing data.This imputation process was carried out iteratively until a 100% completeness of all variables was achieved.This rigorous approach aimed to minimize the impact of missing data on subsequent analyses and ensure the reliability of the results.Thus, the dataset became more robust, allowing for more accurate analysis and interpretation.
In summary, missing value imputation was performed using appropriate techniques to handle missing data, resulting in a more complete dataset suitable for further analysis, as shown again in Figure 1.

Feature selection
Feature selection techniques were then employed to identify the most relevant predictors of poverty.This involved analyzing the correlation between variables, conducting exploratory data analysis, and applying statistical tests to determine the significance and predictive power of each variable.To identify the most relevant predictors of poverty, we employed a comprehensive set of feature selection techniques.Our methodology involved several steps to ensure a thorough analysis of the data.
First, we conducted a correlation plot to examine the relationships between variables.This helped us identify potential associations and dependencies among the predictors.Additionally, we performed descriptive analysis to gain insights into the distribution and summary statistics of the variables.
Next, we applied inferential analysis using classical regression models, specifically logistic regression.In this way, we assessed the significance and predictive capability of each variable in relation to poverty.This allowed us to understand the individual contributions of the predictors and identify statistically significant relationships.
In addition to classical regression models, we implemented four ML algorithms: RF, DT, SVM, and logistic regression.These algorithms provided a more robust and comprehensive analysis of the data, capturing complex relationships and non-linear interactions.
After evaluating the performance of the different ML algorithms, we selected RF as our final approach for feature selection.We recall that RF is an ensemble learning method that combines multiple decision trees to estimate feature importance.By leveraging the collective predictive power of these trees, RF assigns importance scores to each variable, enabling us to rank and select the most influential predictors of poverty.
By incorporating correlation analysis, descriptive analysis, inferential analysis using classical regression models, and four ML algorithms, we employed a rigorous and multi-faceted approach to feature selection.This comprehensive methodology ensured that we identified the most relevant predictors of poverty, enhancing the accuracy and interpretability of our subsequent analyses.

Machine learning methods
To predict poverty in Somalia, several ML algorithms were employed, including the RF, DT, SVM, and logistic regression.These algorithms were implemented using the R software version 4.1.2.The ML models were trained on the preprocessed dataset, with poverty status as the target variable and a set of selected predictor variables.
For the RF algorithm, we utilized the randomForest package in R. We conducted parameter tuning by adjusting the number of trees, maximum depth, minimum node size, and other relevant parameters to optimize the RF model's performance.
Similarly, for the DT algorithm, we utilized the rpart package in R. We conducted parameter tuning by varying the maximum depth, minimum split, and other relevant parameters to identify the optimal settings for the DT model.
For the SVM algorithm, we employed the e1071 package in R. We conducted a grid search combined with cross-validation to determine the optimal values for the hyperparameters, such as the kernel type, cost, and gamma.
Regarding the logistic regression, we utilized the glm function in R. Parameter tuning for logistic regression involved adjusting the regularization parameter and other relevant parameters to optimize the model's performance.

Model evaluation
The performance of the predictive models was assessed using various evaluation metrics.These included the confusion matrix, accuracy, precision, sensitivity, specificity, recall, F1 score, and the AUROC curve.The confusion matrix provided a comprehensive overview of the model's predictive performance, while accuracy measured the overall correctness of the predictions.Precision, sensitivity, and specificity provided insights into the model's ability to correctly identify positive and negative instances.The AUROC curve indicated the model's discrimination power between positive and negative instances.
In addition, cross-validation is a widely used method for assessing the generalizability of ML models.It helps mitigate issues such as overfitting by providing a more robust estimation of the model's performance on unseen data.Specifically, we utilized k-fold cross-validation, where the dataset is divided into k subsets, or folds.The models were trained on (k − 1) folds and evaluated on the remaining fold.This process was repeated k times, rotating the evaluation fold each time.The performance metrics reported in our evaluation reflect the average performance across all folds, ensuring a more reliable and unbiased estimation of the models' predictive capabilities 4 .

Ethical considerations
Acquiring participants for this study is impossible since all personally identifiable information has been removed from the dataset.However, permission to utilize the data was obtained from the Somalia National Bureau of Statistics.Hence, obtaining additional ethical approval may not be necessary.

Statistical analysis
In addition to the ML methods, conventional statistical analysis was performed using STATA version 17. Descriptive statistics were computed to summarize the characteristics of the sample population.A regression analysis was conducted to examine the relationship between predictor variables and poverty outcomes, utilizing appropriate statistical tests and controlling for potential confounding factors.
Logistic regression analysis was employed to model the binary nature of the outcome variable.It is a widely used statistical technique specifically designed for binary outcomes.By utilizing logistic regression, we can effectively analyze the relationship between the independent variables and the binary outcome variable.
To ensure the validity of our logistic regression models, we considered the assumptions specific to this analysis.We assessed the assumption of linearity in the logit by examining the relationship between the log odds of the outcome and the independent variables.Additionally, we verified the assumption of independence of observations, which assumes that the observations are not correlated with each other.
In order to account for potential confounding factors, we carefully selected relevant variables based on prior knowledge and existing literature.These confounding factors were included as independent variables in the logistic regression models to control for their influence on the outcome variable.By considering these factors, we aim to accurately estimate the relationship between the independent variables and the binary outcome while mitigating the impact of potential confounding effects.We examined several categorical explanatory variables to better understand the determinants of poverty in the country, as shown in Table 1.Among the regions, the highest number of respondents (3111; 9.632%) were from Banadir, followed by Sanag (2893; 8.957%).These regional distributions will help us understand the geographical representation of our sample and its implications for poverty rates.
Residence is another important factor.We observed that a significant proportion of respondents lived in urban areas (12,410; 38.5%), while a considerable number resided in nomadic settings (11,117; 34.42%).This will enable us to explore the differences in poverty prevalence between urban and nomadic populations and their potential impact on poverty-related variables.
Access to basic amenities is crucial in understanding poverty dynamics.The majority of respondents had improved water sources (19,039; 58.947%) and improved toilet facilities (12,554; 38.869%), while a notable proportion relied on unimproved sources (41.05%) and unimproved facilities (61.13%).These findings will contribute to our analysis of the relationship between access to basic services and poverty levels in Somalia.
Educational attainment plays a significant role in poverty reduction.The majority of mothers had no formal education (28,120; 87.06%), while a smaller percentage completed primary education (3215; 9.954%).In our study, we will examine the influence of maternal education on poverty outcomes.
Gender dynamics are also important to consider.The household head was predominantly male (21,431; 66.35%), while female heads accounted for 10,867 (33.64%).We will explore the potential role of gender in poverty determination and its interaction with other variables.
Moreover, employment patterns among mothers and husbands were examined.Only a small proportion of mothers were employed (366; 1.133%), while the majority were unemployed (31,932; 98.866%).Among husbands, 43.299% were employed (13,985), while 56.70% were unemployed (18,313).These employment figures will help us understand the relationship between household income and poverty status.
By considering these categorical explanatory variables from the SDHS 2020 dataset, our study aims to employ ML algorithms to identify the key determinants of poverty in Somalia.We will compare the predictive performance of new contenders with classical models, aiming to provide valuable insights into poverty dynamics and contribute to poverty alleviation efforts in the country.

Descriptive statistics of the continuous predictor variables
The descriptive statistics for the continuous variables are presented in Table 2.The minimum number of household members was 0, while the maximum was 9.The average household size was 5.3, with a standard deviation of 2.17.The age of the household head ranged from 15 years to 49 years, with a mean of 38.178 years and a standard deviation of 21.80.

Correlation analysis
A correlation analysis was conducted to examine the relationships between variables in our feature selection process.The correlation coefficients were computed and visualized in a correlation plot, which provided a comprehensive overview of the pairwise correlations.By analyzing the plot, we identified variables with strong positive or negative correlations and considered potential multicollinearity issues.This analysis served as a valuable initial step, guiding our subsequent feature selection by highlighting the most relevant predictors of poverty.The findings from the correlation analysis were summarized in the correlation plot shown in Figure 2, enabling us to make informed decisions and enhance the overall effectiveness of our feature selection methodology.

Inferential analaysis
The logistic regression analysis aimed to identify key factors associated with poverty in Somalia using the first-ever SDHS dataset from 2020.Several variables were examined, including age group, maternal education, household size, age of the household head, sex of the household head, maternal employment status, husband employment status, husband education, region, place of residence, water source, and toilet facility.
The results revealed significant associations between certain variables and the odds of poverty, as presented in Table 3.Individuals with primary education had nearly three times higher odds of poverty (OR = 2.95, 95% CI [1.59, 5.50]), while those with secondary education exhibited significantly higher odds (OR = 26.60,95% CI [11.24, 63.09]) compared to individuals with no education.Similarly, individuals with higher education had 8.49 times higher odds of poverty (OR = 8.49, 95% CI [1.57, 45.86]).
Regarding husband-related variables, households with unemployed husbands demonstrated substantially higher odds of poverty (OR = 3.07, 95% CI [1.78, 5.31]) compared to households with employed husbands.However, husbands' education did not exhibit significant associations with poverty.
Geographically, the analysis revealed variations in the odds of poverty across different regions of Somalia.Each region had its own odds ratio, indicating the likelihood of poverty in that specific region compared to the reference region (Awdal).However, the interpretation of these regional odds ratios requires further context and examination.
In terms of place of residence, individuals living in urban areas had slightly lower odds of poverty (OR = 0.89, 95% CI [0.77, 1.03]) compared to those in rural areas, while individuals residing in nomadic areas had 1.60 times higher odds of poverty (OR = 1.60, 95% CI [1.11, 2.32]).
Vol:.( 1234567890 www.nature.com/scientificreports/In summary, these findings highlight the multifaceted nature of poverty in Somalia and underscore the importance of addressing factors such as education, gender dynamics, and access to basic amenities in efforts to alleviate poverty levels in the country.

Machine learning models performance and predicting poverty
Table 4 provides various evaluation metrics to assess the performance of each predictive model in predicting poverty.These metrics include accuracy, recall, sensitivity, specificity, positive predictive value, negative predictive value, precision, F1 score, prevalence, detection rate, detection prevalence, and balanced accuracy.
Among the models, RF achieved the highest accuracy at 96.38%, followed by the logistic regression at 74.95%, DT at 73.73%, and SVM at 67.21%.The RF also demonstrated the highest recall (95.90%) and sensitivity (95.90%), indicating its ability to correctly identify the majority of individuals experiencing poverty.The logistic regression had a recall of 70.49%, while the DT and SVM had lower recall values of 66.44% and 62.49%, respectively.
In terms of specificity, the RF performed the best at 96.80%, followed by the DT at 86.17%, logistic regression at 80.09%, and SVM at 73.45%.These values indicate the models' ability to correctly identify non-poor individuals.www.nature.com/scientificreports/ The positive predictive value (also known as precision) measures the proportion of correctly predicted poor individuals among all predicted poor cases.The RF achieved the highest positive predictive value at 96.41%, followed by the logistic regression at 80.34%, SVM at 75.66%, and DT at 89.13%.The negative predictive value measures the proportion of correctly predicted non-poor individuals among all predicted non-poor cases.The RF had the highest negative predictive value at 96.35%, followed by the logistic regression at 70.17%, DT at 60.06%, and SVM at 59.71%.
The F1 score, which balances precision and recall, was highest for the RF at 96.16%, followed by the DT at 76.13%, logistic regression at 75.09%, and SVM at 68.44%.These scores indicate the overall performance of the models in capturing both the positive and negative classes.The prevalence indicates the proportion of individuals experiencing poverty in the dataset.The logistic regression had a prevalence of 53.57%, followed by the DT at 63.06%, SVM at 56.91%, and RF at 47.25%.
The detection rate measures the proportion of correctly predicted poor individuals among all actual poor cases.The RF achieved the highest detection rate at 45.32%, followed by the DT at 41.90%, logistic regression at 37.76%, and SVM at 35.56%.The detection prevalence represents the proportion of predicted poor individuals among all individuals in the dataset.All models had a detection prevalence of 47.00%.
Finally, balanced accuracy provides an average of sensitivity and specificity, giving equal weight to both classes.The RF had the highest balanced accuracy at 96.35%, followed by the DT at 76.30%, logistic regression at 75.29%, and SVM at 67.97%.In summary, the RF outperformed the other models in terms of accuracy, recall, sensitivity, specificity, positive predictive value, negative predictive value, precision, F1 score, and balanced accuracy.However, it's important to consider other factors, such as model complexity, interpretability, and computational requirements, when choosing the most appropriate predictive model for a specific context.Additionally, in Figure 3, we present a comprehensive comparison of model performance metrics for ML models.The metrics evaluated include accuracy, F1 score, precision, area under the receiver operating characteristic curve (AUROC), and sensitivity.This figure provides a visual summary of the performance of each algorithm, allowing for a quick and insightful comparison of their predictive capabilities.
Figure 4 illustrates the AUROC curve visualization in this study.Among the four ML models utilized, the ROC curve of the RF model exhibits the highest area under the curve (AUC) value.This signifies that the RF model outperforms the other models in accurately classifying cases as either poor or well-off.

Importance features selection
In this study, we investigated the feature selection process using four popular ML algorithms: RF, DT, SVM, and logistic regression.Each algorithm was utilized to assess the importance and relevance of features in the dataset.By comparing the results of these four models, we aimed to identify the most informative features for our analysis.The feature selection process plays a crucial role in enhancing the performance and interpretability of ML models.Through this investigation, we aimed to gain insights into the relative strengths and limitations of each algorithm in terms of feature selection.This knowledge will contribute to making informed decisions regarding the inclusion or exclusion of features in subsequent analyses and modeling tasks.All the feature selections are summarized in Figs. 5, 6, 7, and 8.
After evaluating the feature selection results obtained from the RF, DT, SVM, and logistic regression, we chose to prioritize the RF for several reasons.The RF demonstrated superior performance across multiple evaluation metrics, including accuracy, sensitivity, AUROC, F1 score, precision, and other relevant metrics.Its ability to handle high-dimensional data, capture complex interactions, and be robust to outliers made it a compelling choice.Additionally, RF's built-in feature importance calculation based on metrics like Gini impurity or mean decrease in accuracy provided valuable insights into the relevance and significance of features.The overall combination of its excellent performance and comprehensive feature importance analysis solidified our decision to consider RF as the primary feature selection method for our study.
Thus, in our statistical context, we employed a RF classifier (Fig. 5) to identify significant features associated with poverty.The analysis revealed a set of 13 key features that contribute to poverty, including administrative region, household size, age group of respondents, employment status of the husband, age of the household head, place of residence, education level of the husband, exposure to mass media, sex of the household head, source of drinking water, maternal education, type of toilet facility, and employment status of the mother.These findings highlight the complex nature of poverty and underscore the importance of considering various socio-economic factors when addressing poverty-related issues.The insights gained from this feature selection process, combined with RF's excellent performance and comprehensive feature importance analysis, provide a solid foundation for further analysis and the development of targeted interventions to alleviate poverty in the studied population.

Discussion
The logistic regression analysis aimed to identify key factors associated with poverty in Somalia using the firstever SDHS dataset from 2020.Our results revealed significant associations between certain variables and the odds of poverty.Individuals with primary education had nearly three times higher odds of poverty, while those with secondary education exhibited significantly higher odds.Female-headed households had lower odds of poverty compared to male-headed households.Unemployed husbands and households with unimproved water sources  In comparing our study findings to previous research, it is important to note that limited studies have specifically focused on predicting poverty in Somalia using advanced ML techniques.However, our results align with previous studies conducted in similar contexts.
Regarding the association between education and poverty, our results are consistent with prior studies that have identified education as a significant determinant of poverty in developing countries 36 .Individuals with  higher levels of education generally have better employment prospects and income-earning opportunities, reducing their likelihood of experiencing poverty.
The finding that female-headed households exhibit lower odds of poverty aligns with existing literature on gender and poverty 37,38 .Female-headed households often face additional challenges, such as limited access to resources and economic opportunities.However, our results suggest that these households may have developed strategies for resilience and economic empowerment, leading to lower poverty rates compared to male-headed households.
The association between household characteristics (such as household size and age of the household head) and poverty found in our study is consistent with previous research that highlights the complex interplay between household dynamics and poverty 23,39 .While we did not find statistically significant associations for these variables, their potential influence on poverty cannot be overlooked, and further research could explore their nuanced effects.
The regional disparities in poverty rates identified in our study align earlier research that has documented spatial variations in poverty within countries 40,41 .This suggests the need for targeted regional policies and interventions to address localized poverty challenges and promote equitable development.
The association between access to basic amenities (water sources and toilet facilities) and poverty is consistent with studies emphasizing the importance of infrastructure and sanitation in poverty reduction 37 .Lack of access to improved water and sanitation facilities can exacerbate health and economic vulnerabilities, contributing to higher poverty rates.
While our study contributes to the understanding of poverty determinants in Somalia, further research is needed to expand upon these findings and compare them with a broader range of studies examining poverty dynamics in similar contexts.
We also aimed to apply ML algorithms to identify the key determinants of poverty in Somalia using the firstever SDHS 2020 dataset.The performance of various predictive models, including logistic regression, RF, DT, and SVM, was evaluated and compared.The findings provide insights into the effectiveness of these models in predicting poverty and offer implications for poverty alleviation strategies in Somalia.
The RF model emerged as the top-performing model in this analysis.It achieved the highest accuracy (96.38%), indicating its ability to make correct predictions for a significant portion of the dataset.The model also demonstrated high recall (95.90%), specificity (96.80%), precision (96.41%), and F1 score (96.16%), indicating its strong performance in identifying both the "Poor" and "Well" classes.These results suggest that the RF model is well-suited for identifying poverty determinants in Somalia and can potentially contribute to targeted interventions and poverty reduction efforts.
The logistic regression model also showed promise, although its performance was slightly lower than that of the RF model.With an accuracy of 74.95% and a recall of 70.49%, the logistic regression model displayed a reasonably good capacity for classifying instances.However, its specificity (80.09%) and precision (80.34%) were relatively lower, indicating a higher number of false positives.Nevertheless, the logistic regression model can still provide valuable insights into poverty determinants and contribute to an understanding of the factors driving poverty in Somalia.The DT model exhibited competitive performance, with an accuracy of 73.73% and a recall of 66.44%.It demonstrated relatively high specificity (86.17%) and precision (89.13%), showcasing its ability to effectively identify the "Well" class.However, the model had a lower recall and F1 score compared to the RF model, implying a higher number of false negatives.Despite this limitation, the DT model can still offer valuable insights into the key determinants of poverty in Somalia.
In contrast, the SVM model demonstrated the lowest overall performance among the evaluated models.With an accuracy of 67.21% and a recall of 62.49%, the SVM model struggled to accurately classify instances.Its specificity (73.45%), precision (75.66%), and F1 score (68.44%) were also relatively lower compared to the other models.While the SVM model may have limitations in predicting poverty determinants in Somalia, it can still contribute to the overall understanding of the problem and provide additional perspectives.
It is important to consider the prevalence of positive instances in the dataset when interpreting the results.The DT model had the highest prevalence value (63.06%), indicating a higher proportion of instances belonging to the "Well" class.On the other hand, the RF model had the lowest prevalence value (47.25%), suggesting a more balanced distribution of the two classes.These prevalence values have implications for the generalizability of the findings and should be taken into account when designing targeted poverty reduction interventions.
Overall, the results of this study highlight the potential of ML algorithms, particularly the RF model, in identifying the key determinants of poverty in Somalia.The findings can inform policymakers and stakeholders involved in poverty alleviation efforts, providing them with valuable insights into the factors driving poverty and enabling them to develop more effective strategies.However, it is crucial to acknowledge the limitations of the study, such as the reliance on a single dataset and the need for further research to validate and expand upon these findings.Future studies can explore additional variables, consider alternative models, and incorporate external data sources to enhance the accuracy and robustness of poverty prediction models in Somalia.
In addition, future research could consider incorporating various other factors to enhance the understanding of poverty determinants in Somalia.These may include variables such as household assets, access to electricity and clean energy, food security and nutrition status, geographical location and proximity to essential services, gender and household composition, quality of housing and infrastructure, social and cultural factors influencing poverty, access to financial services and credit opportunities, exposure to conflict and violence, as well as government policies and interventions related to poverty alleviation.By examining these variables in future studies, we gain a more comprehensive perspective on the complex dynamics of poverty in Somalia.This deeper understanding can contribute to the development of targeted interventions and policies aimed at addressing poverty challenges effectively and improving the overall well-being of the Somali population.

Conclusion
In conclusion, this comprehensive study utilized logistic regression analysis to identify the significant determinants of poverty in Somalia.The findings highlight the crucial roles played by various factors, including age group, maternal education, household size, age and sex of the household head, maternal and husband employment status, husband education, region, place of residence, water source, and toilet facility, in shaping poverty outcomes.These insights offer valuable guidance to policymakers and stakeholders in designing targeted interventions and policies aimed at reducing poverty and fostering inclusive socioeconomic development in the Somali context.
Moreover, the study conducted a rigorous performance evaluation of different predictive models, encompassing logistic regression, RF, DT, and SVM.By utilizing the provided confusion matrix, the results indicate that the RF model exhibited the highest accuracy (96.38%) and specificity (96.80%) among the evaluated models, surpassing others in accurately predicting both poor and well-off outcomes.However, it is essential to consider interpretability and computational complexity when selecting the most suitable model for practical implementation.
To further enhance the understanding and application of these models, future research endeavors should focus on exploring the causal relationships between the identified determinants of poverty and poverty outcomes in Somalia.Additionally, efforts should be directed towards refining the predictive capabilities and overall performance of the models to effectively address the specific needs and complexities of the Somali context within the field of management science.

Limitations of the study
It is important to acknowledge several limitations when interpreting the findings of this study.Firstly, the dataset used in the analysis is from 2020, collected before the COVID-19 pandemic.As a result, the data may not fully capture the current situation of households, particularly the impact of the pandemic on poverty.Future research should consider incorporating more recent data to validate and update the empirical results.Considering the potential implications of the pandemic strengthens the need for contextualizing and interpreting the study's findings in light of the evolving poverty dynamics influenced by the crisis.
Secondly, while the RF was recommended by UN researchers for poverty prediction, it is important to note that different models may be more suitable for different situations.Therefore, the RF may not necessarily be the best model for poverty prediction in all contexts.Alternative models should be explored and compared to determine the most appropriate approach for poverty analysis.
Thirdly, while improving poverty prediction is crucial for poverty reduction and enhancing quality of life, identifying households at high risk of poverty is just one step in addressing the issue.The practical impact of poverty reduction relies on the implementation of targeted policies adopted to specific contexts, which may vary from one place to another.Therefore, the findings of this study should be considered in conjunction with the development and implementation of effective poverty reduction strategies.

Figure 4 .
Figure 4. AUROC curves for the four competitive ML models.

Figure 5 .
Figure 5. Important features selected for the SVM model.

Figure 6 .
Figure 6.Important features selected for the DT model.

Figure 7 .
Figure 7. Important features selected for the logistic regression model.

Figure 8 .
Figure 8. Important features selected for the RF model.

Table 1 .
unimproved toilet facilities (OR = 1.24, 95% CI [1.04, 1.48]) exhibited increased odds of poverty compared to households with improved water sources and toilet facilities, respectively.Frequency distribution of categorical variables.
Correlation plot for the variables of the study.

Table 3 .
Logistic regression model analysis of key determinants associated with poverty.Significant values are in [bold].