Poverty prediction using E-commerce dataset and filter-based feature selection approach

Wijaya, Dedy Rahman; Ibadurrohman, Raden Ilham Fadhilah; Hernawati, Elis; Wikusna, Wawa

doi:10.1038/s41598-024-52752-7

Download PDF

Article
Open access
Published: 07 February 2024

Poverty prediction using E-commerce dataset and filter-based feature selection approach

Dedy Rahman Wijaya¹,
Raden Ilham Fadhilah Ibadurrohman¹,
Elis Hernawati¹ &
…
Wawa Wikusna¹

Scientific Reports volume 14, Article number: 3088 (2024) Cite this article

519 Accesses
Metrics details

Subjects

Abstract

Poverty is a problem that occurs in many countries, notably in Indonesia. The common methods used to obtain poverty information are surveys and censuses. However, this process takes a long time and uses a lot of human resources. On the other hand, governments and policymakers need a faster approach to know social-economic conditions for area development plans. Hence, in this paper, we develop e-commerce data and machine learning algorithms as a proxy for poverty levels that can provide faster information than surveys or censuses. The e-commerce dataset is used and this high-dimensional data becomes a challenge. Hence, feature selection algorithms are employed to determine the best features before building a machine learning model. Furthermore, three machine learning algorithms such as support vector regression, linear regression, and k-nearest neighbor are compared to predict the poverty rate. Hence, the contribution of this paper is to propose the combination of statistical-based feature selection and machine learning algorithms to predict the poverty rate based on e-commerce data. According to the experimental results, the combination of f-score feature selection and support vector regression surpasses other methods. It shows that e-commerce data and machine learning algorithms can be potentially used as a proxy for predicting poverty.

Worldwide divergence of values

Article Open access 09 April 2024

Principal component analysis

Article 22 December 2022

Bayesian statistics and modelling

Article 14 January 2021

Introduction

In the last decades, poverty has been a common problem in developing countries. In March 2018, the Central Bureau of Statistics reported that the number of poor people in Indonesia reached 25.95 million people (9.82 percent). This number decreased by around 633.2 thousand people compared to September 2017 of 26.58 million people (10.12 percent)¹. The data was obtained by the Central Bureau of Statistics by conducting a national socio-economic survey commonly abbreviated as SUSENAS. It is a household-based survey that collects information on socio-economic characteristics such as education, health, family planning, travel information, crime, housing, social protection, and household consumption and expenditure². SUSENAS survey can be estimated to take a long time and there will be a period between one survey and another. In terms of cost, it is also very expensive. Hence, how the government estimates poverty to achieve better program targets is not an easy task. On the other hand, the digital revolution continues to generate abundant data, which provides new opportunities to capture information about socio-economic conditions at various levels of abstraction to summarize development progress. These data can be used to monitor changes in the prosperity level, as well as to measure the impact of government programs. One prospective data source to capture socio-economic conditions is e-commerce data. The e-commerce market in Indonesia is one of the largest in Southeast Asia with a contribution of up to fifty percent of all transactions in the Southeast Asian region. The growth of the population of internet users can increase e-commerce penetration in Indonesia so that its contribution to the economy has the potential to continue to increase. Even without taking into account the B2B service sector, the gross merchandise value of the e-commerce market in Indonesia is projected to grow around eight times by 2022³.

In the last few years, several data sources have been reported for poverty estimation such as satellite imagery and call detail records (CDRs)^4,5,6,7. However, these datasets have assumptions for example the light intensity of nightlight data from satellite imagery reflects high economic activity in a particular area. Moreover, high mobile phone credit is related to welfare in CDRs data. In contrast, e-commerce data can reflect the real expenditure for necessities at the household level without assumptions⁸. Thus, this dataset has more complied with the formal calculation of poverty level. Nevertheless, the study of e-commerce data for poverty prediction is relatively new and rare. Furthermore, the one challenge of e-commerce data utilization is high-dimensional data that must be reduced for performance improvement of machine learning algorithms. Our hypothesis is a feature subset produced by feature selection algorithm can improve the performance of machine learning algorithm.

According to this background, we propose a solution to complement survey and census by using e-commerce data and machine learning algorithms, especially in Indonesia. The proposed method can be used as a fast and low-cost solution to predict the poverty level. It can be used by governments and policymakers as a baseline to determine development policies. The e-commerce dataset contains the calculation of the number of purchases from a particular area, so, it can be seen in that region whether the area is prosperous or vice versa. Thus, the contribution of this research is to propose the combination of statistical-based feature selection and machine learning algorithms to build a model for predicting poverty in Indonesia using a dataset that represents Indonesian people’s needs based on e-commerce data. In this study, we have several motivations as follows:

1.
This study used e-commerce data from one of the largest e-commerce companies in Indonesia. By using the e-commerce data, the data source can be rapidly updated to complement the National Socio-economic Survey that records poverty every 5 years. Real-time poverty prediction can help governments and policymakers to determine the priority of development plans.
2.
Existing studies utilized several methods to predict poverty such as using phone records⁹, satellite imagery^10,11, and small area estimation¹². However, these studies used several assumptions to predict poverty. On the other hand, we use an e-commerce dataset obtained from one of the e-commerce platforms in Indonesia. We argue that e-commerce data can represent the economic conditions in a particular area.

Hence, to the best of our knowledge, the study of poverty estimation using e-commerce data and machine learning algorithms is relatively new and rare. In our previous works, we have performed poverty prediction using machine learning algorithms. However, it only uses one feature selection algorithm that makes the study quite limited^8,13. In addition, wrapper-based feature selection algorithm were also utilized, but it could not provide satisfactory results¹⁴. Hence, to emphasize the originality, in this paper, we use three statistical-based feature selection and three machine learning algorithms to find the best model for poverty estimation. This approach not only can be used for Indonesian data but also potentially adopted for e-commerce data from other countries.

Methods

The dataset used in this research is sample advertising data from one of the e-commerce companies in Indonesia which was regenerated and changed in value. There are eight items such as motorbikes, cars, apartments for sale, apartments for rent, houses for sale, houses for rent, land for sale, and land for rent in 2016. To measure the level of poverty, a poverty limit/line is needed. The poverty line reflects the rupiah value of the minimum expenditure a person needs to fulfill his basic life needs a month, both food and non-food needs. These items are included in the list of basic needs commodities¹⁵. The advertisements used for this study are from Java Island which is the most contributed island for posting the advertisement in total 18,881,913 advertisements in 118 cities/districts. Table 1 shows the detail of the dataset used in this study.

Table 1 Dataset description.

Full size table

From each item in Table 1, four aspects were aggregated per city such as the number of items sold, selling price, number of buyers, and number of viewers for three statistical features (average, sum, and standard deviation). Initially, the used e-commerce dataset contains 96 numeric features and an actual poverty rate as a continuous label. The ground truth of poverty rate was referred to the official poverty rate issued by BPS (Statistics Indonesia) in the current year. Thus, we have 8 items × 4 aspects × 3 statistical features = 96 features according to this way as shown in Fig. 3. Our dataset contains 96 columns (features) × 118 rows (cities). The extraction of items and aspects from the dataset is shown in Fig. 1. Because the data dimension is relatively huge, we used statistical-based feature selection algorithms to select the most relevant features and machine learning algorithms to train and build prediction models using data from the selected feature. Generally, the prediction process using machine learning algorithms and statistical-based feature selection has five stages. It is started with pre-processing data, normalization, feature selection, training model, and evaluation.

Figure 2 shows the flow of the proposed method. It starts by extracting e-commerce data into items, aspects, and calculates the statistic aggregation values. The extracted data still contains dirty data. Thus, the dirty data needs to be cleaned up. Clean data will be normalized to be scaled to the same scale. Data normalization is necessary to uniform the scale. So, it is converted into [0, 1]. In this research work, the min–max normalization method is used for the normalization method. The min–max method is one of the normalization techniques to standardize the dataset using linear transformation¹⁶. This normalization method transforms e-commerce data into a fixed range. This normalization method ensures that a huge data range is constrained within a specific range. It transforms a value X₀ to X_p which fits in the specified range. The criteria are given by Eq. (1),

$${X}_{p}=\frac{{X}_{o}-{\text{min}}(x)}{{\text{max}}\left(x\right)-{\text{min}}(x)}$$

(1)

where X_p is the new value for variable X, X_O is the current value for variable X. min(x) and max(x) are the minimum and the maximum data points in the dataset, respectively. We used min–max normalization because of its performance for having less number of misclassification errors. Also, it has been reported for satisfactory performance in supervised and unsupervised learning^17,18,19.

Feature selection algorithms

Many datasets have a high dimension such as the marketplace, healthcare, social media, etc. However, these high-dimensional data cause a problem for the algorithm that was designed for low-dimensional space. They can also increase the memory usage of the computer. To deal with high-dimensional data, this paper uses several filter-based feature selection algorithms such as f-score, chi-square, and correlation-based feature selection (CFS). The filter method does not rely on any learning algorithm. They rely on data characteristics to assess the importance of features. Filter methods are usually more computationally efficient than other methods²⁰.

F-score is known as a simple technique for measuring discrimination of two sets of real numbers²¹. Given training vector x_k, k = 1, 2,…, m, if the number of positive and negative instances are n₊ and n_– respectively, then f-score of the ith feature is defined as Eq. (2),

$${F}_{i}= \frac{{\left({\overline{x} }_{i}^{\left(+\right)}- {\overline{x} }_{i}\right)}^{2}+ {\left({\overline{x} }_{i}^{\left(-\right)}- {\overline{x} }_{i}\right)}^{2}}{\frac{1}{{n}_{+}-1{\sum }_{k=1}^{n+}{\left({x}_{k,i}^{\left(+\right)}-{\overline{x} }_{i}^{+}\right)}^{2}}+ \frac{1}{{n}_{-}-1}{\sum }_{k,i}^{n-}{\left({x}_{k,i}^{\left(-\right)}-{\overline{x} }_{i}^{-}\right)}^{2}}$$

(2)

where ${\overline{x} }_{i}^{\left(+\right)}$, ${\overline{x} }_{i}$, $\overline{x }$ are the average of the ith feature of the whole, positive, and negative datasets, respectively. The numerator indicates the discrimination between the positive and negative sets, and the denominator indicates the one within each of the 2 sets. The greater f-score value indicates the feature is more discriminative.

Chi-square utilizes the test of independence to assess if the feature is independent of the class label²². Chi-square criteria can be defined in Eq. (3),

$${x}^{2}= \sum_{j=1}^{2}\sum_{s=1}^{c}\frac{{({n}_{js}- {\mu }_{js})}^{2}}{{\mu }_{js}}$$

(3)

where c = number of classes, ${n}_{js}$ = number of patterns in the jth interval, sth class, ${R}_{i}$= number of patterns in the jth interval = ${\sum }_{s=1}^{c}{n}_{js}$, ${K}_{s}$= number of patterns in the sth class = ${\sum }_{j=1}^{2}{n}_{js}$, $N$ = total number of patterns = ${\sum }_{j=1}^{2}{R}_{i}$, ${\mu }_{js}$ = expected frequency of ${n}_{js}$ = ${R}_{i} \times \frac{{K}_{s}}{N}.$

If ${R}_{i}$ and ${K}_{s}$ is 0, ${\mu }_{js}$ is set to 0.1. A higher chi-square value indicates that the feature is relatively important. However, Chi-Square algorithm needs discrete values to perform feature selection. Hence, in this experiment, the poverty rate as continuous values is rounded to get discrete values. Moreover, the basic idea of CFS algorithm is to use a correlation-based heuristic to evaluate the worth of a feature subset²³. The CFS criteria are defined in Eq. (4),

$${merit}_{s}= \frac{k\overline{{r }_{cf}}}{\sqrt{k+k(k-1)}\overline{rff} }$$

(4)

where the score shows the heuristic “merit” of a feature subset s containing k features, $\overline{{r }_{cf}}$ is the mean of class correlation where $f \in s$, $\overline{rff }$ is the average feature intercorrelation. The basic idea is the stronger correlation with the class label and the weaker intercorrelated to each other is the better feature subset.

Machine learning algorithms

Mainly, in this research, we used a support vector regression (SVR) algorithm because the SVM for classification already showed good results in medical diagnostics, optical character recognition, electric load forecasting, and other fields²⁴. Also, the SVR algorithm is the most common application form of SVMs to build a machine learning regression model^25,26. The advantages of SVM are a unique solution, not sensitive to small changes of parameters, and providing increased performance²⁷. In addition, the SVM is a machine learning algorithm that implements structural risk minimalization to obtain good generalization on a limited number of learning patterns²⁸. The data from several stages before will be used to train in this stage. Considering a training dataset, {($\overrightarrow{{x}_{1}},{z}_{1}$), …, ($\overrightarrow{{x}_{i}},{z}_{i}$)} that corresponds to features where $\overrightarrow{{x}_{i}}, {z}_{i}$ are feature vector and target output, respectively. The standard criteria of SVR are given in Eq. (5) through Eq. (9)²⁹.

$$\underset{w,b,\xi ,{\xi }^{*}}{{\text{min}}}\mathit{ }\frac{1}{2}{w}^{t}w+C\sum_{i=1}^{l}{\xi }_{i}+ C\sum_{i=1}^{l}{\xi }_{i}^{*}$$

(5)

$$\begin{aligned} {\text{subject to}}\,\,\, & w^{t} \emptyset \left( { x_{i} } \right) + b - z_{i} \le \varepsilon + \xi_{i} , \\ & z_{i} - w^{t} \emptyset \left( {x_{i} } \right) - b \le \varepsilon + \xi_{i}^{*} , \\ & \xi_{i} , \xi_{i}^{*} \ge 0,\quad i = 1, \ldots ,l. \\ \end{aligned}$$

(6)

where $w, C, \xi ,\varepsilon ,$ and $b$ as slope matrix, regularization parameter slack variable for soft margin, tolerance margin, and the intercept/bias, respectively. The dual problem is

$$\underset{\propto ,{\propto }^{*}}{\mathrm{min }}\frac{1}{2}{\left(\alpha - {\alpha }^{*}\right)}^{t}Q\left(\alpha - {\alpha }^{*}\right)+\varepsilon \sum_{i=1}^{l}\left({\alpha }_{i} - {\alpha }_{i}^{*}\right) +\sum_{i=1}^{l}{z}_{i}\left({\alpha }_{i} - {\alpha }_{i}^{*}\right)$$

(7)

$$\begin{aligned} {\text{subject to}}\,\,\,\, & e^{t} \left( {\alpha {-} \alpha^{*} } \right) = 0, \\ & 0 \le \alpha_{i} ,\alpha_{i}^{*} \le C,\quad i = 1, \ldots , l, \\ \end{aligned}$$

(8)

where $\alpha - {\alpha }^{*}$ denotes Lagrangian multipliers. ${Q}_{ij}=K\left({x}_{i},{x}_{j}\right)\equiv \varnothing {\left({x}_{i}\right)}^{t}\varnothing \left({x}_{j}\right).$ The approximate function after solving the problem in Eq. (8) is

$$\sum_{i=1}^{l}\left({-\alpha }_{i}+ {\alpha }_{i}^{*}\right)K\left({x}_{i}, x\right)+b$$

(9)

the output from the model is ${\alpha }^{*}- \alpha$.

To ensure the model made has good parameter values, in this experiment, grid search was performed to determine kernel between RBF and polynomial, epsilon value within [0.1, 0.5, 1.0, 1.5, 2.0], parameter C within [1, 10, 100, 1000], and gamma within [0.001, 0.0001]. LIBSVM was used as the library for SVR. From grid search, we determined to use kernel, epsilon, C, and gamma parameters are RBF, 0.5, 10, 0.001, respectively.

Also, in this research, we used k-nearest neighbor regression (k-NN) and linear regression (LR) to compare with SVR performance. The k-NN algorithm is a method for classifying objects based on the closest training example in feature space³⁰. K-NN is a type of lazy learning where the function is only approximated locally. The same method can be used for regression by assigning the property value for the object to be the average of the values of its K nearest neighbor. K-NN is widely adopted for classification and regression because of its simplicity and intuitiveness³¹. While LR is used to study the linear relationship between a dependent variable and one or more independent variables³².

Evaluation

For evaluation, the leave-one-out method was used for cross-validation. In this experiment, we used the root mean squared error (RMSE) and R-squared (R²) to measure the performance of the machine learning model. To measure the error between actual data and prediction vectors, the RMSE is used. The best prediction results are obtained if the RMSE value is low. It means the difference between actual and prediction data is low. Equation (10) shows the equation of RMSE,

$$RMSE\left(y, \widehat{y}\right)= \sqrt{\frac{{\sum }_{i=1}^{L}{({y}_{i}- {\widehat{y}}_{i})}^{2}}{L}}$$

(10)

where, $y, \widehat{y}, L$ indicate actual value, prediction value, and data length, respectively. In addition, to measure the performance, we used R² as shown in Eq. (11) to show the parts of the variance of the actual data. R² will assess the regression model and whether the model can correctly predict the actual value. The R² value will be ranged from 0 to 1 and if the value is nearly 1 or even 1, it means the model is almost perfect in predicting the actual data. Otherwise, if the value equals 0 or negative, it means the model does not follow the trend of the actual data³³.

$${R}^{2}\left(y, \widehat{y}\right)= 1- \frac{\sum_{i=1}^{L}{({y}_{i}- {\widehat{y}}_{i})}^{2}}{\sum_{i=1}^{L}{({y}_{i}- {\overline{y} }_{i})}^{2}}$$

(11)

Results and discussion

F-score, chi-square, and CFS feature selection are used to select some of the most relevant features. The feature selection was used to rank all the features in the dataset. The result of this stage is to rank the feature indexes. Tables 2 and 3 is feature selection result for f-score and chi-square, respectively. Also, we found that CFS produced an inconsistent index ranking in every experiment. The experimental results are shown in Table 4.

Table 2 F-score feature ranking.

Full size table

Table 3 Chi-square feature ranking.

Full size table

Table 4 CFS feature ranking experiment.

Full size table

According to the experiment, the first six feature index is consistent with the ranking, while the others show inconsistency. Thus, we decided that the results of the CFS algorithm cannot be used for building a machine learning model. We only used f-score and chi-square feature selection for building the machine learning model. After the feature is ranked, we also try to find the best result by conducting several prediction experiments from the lowest number of features, starting from 10 features, 20 features, 30 features, 40 features onwards to 96 features. Then, the evaluation process is included in every experiment using R² and RMSE. Also, we used LR and k-NN besides SVR to compare and prove that SVR is the best method. Tables 5 and 6 show SVR prediction results without feature selection and with feature selection, respectively.

Table 5 Prediction experiments without feature selection.

Full size table

Table 6 Prediction experiments with feature selection.

Full size table

Tables 5 and 6 show the results of our experiments in this study. The experiment results are based on the number of features, the machine learning algorithm, and the feature selection algorithm. Bold text indicates the best results. Table 5 shows the comparison of machine learning algorithms without using feature selection and the best R² score obtained by SVR is 0.42321. Table 6 shows the SVR experiments using feature selection and the best R² value is 0.42765 using f-score feature selection and 90 features as shown in Table 2. According to these experimental results, a feature selection algorithm can improve the performance of machine learning to handle high-dimensional data of the e-commerce dataset.

The best models of SVR, k-NN regressor, and LR are visualized in Fig. 3a–c, respectively. The blue line and the yellow lines are the trend of actual data (y = x) and the threshold (± 1.5), respectively. The data between the error margins means the data has less prediction error and vice versa.

The visualized data of SVR result shows more data with less error than other models. Thus, it makes SVR better than other models that we compared in this paper. The worst data visualization is displayed while predicting using linear regression, some data are far from the actual data. It means that LR produces the worst model.

Furthermore, the results are also visualized in the choropleth map in Figs. 4 and 5. Figure 4 shows the Java Island actual poverty rate mapping. They were generated using Leaflet 1.6.0³⁴. Leaflet is an open source JavaScript library used to build web mapping applications. The darker color indicates a higher poverty rate and vice versa. The predicted poverty mapping is displayed in Fig. 5. From the actual data in Fig. 4, predicted data in Fig. 5 shows a lower level of poverty rate. This result indicates that the prediction model produces underestimation results. Finally, Table 7 presents a detailed comparison between the actual and predicted poverty rates at the city level. The table provides a breakdown of each poverty percentage value for a comprehensive analysis.

Table 7 Comparison of actual and predicted poverty rates.

Full size table

Conclusion

E-commerce data has the potential to predict poverty. Hence, we try to use the machine learning algorithm to model the e-commerce data. Three feature selection algorithms were used to select the best features. Then, support vector regression is used to predict the poverty rate. The experimental results show that using all features cannot guarantee good performance. F-score shows the best result among the three other statistical-based feature selection algorithms evaluated by using RMSE and R². It produces the highest R² value and the lowest RMSE value. This result indicates that a feature selection algorithm can give performance improvement of a machine learning algorithm for poverty prediction. Besides, we found that the CFS feature selection shows an unstable feature rank. Moreover, the weakness of the proposed method still has difficulty in predicting regions with a higher poverty rate. The model shows its advantages which are error minimization compared to other algorithms. Therefore, the performance gap between the SVR model and the other machine learning e.g. K-NN and LR models is quite large. Overall, results show the potential of implementation of e-commerce data, feature selection algorithm, and machine learning algorithm for poverty estimation. Governments, policymakers, and researchers can consider e-commerce datasets as a proxy for socio-economic conditions. The study has limitations because it uses only 1 year of data. Hence, more data are needed to improve the machine learning model's performance. The additional data might produce a better model, especially for underestimated results. For future research, larger data must be utilized for a more accurate poverty model. However, the major limitation of e-commerce data is data accessibility and confidentiality, making it difficult to get.

Acronyms

Acronyms used in this paper can be seen in Table 8.

Table 8 Acronym list.

Full size table

Data availability

Data are not publicly available and can be obtained by contacting the corresponding author if necessary.

References

BPS. Profil Kemiskinan di Indonesia Maret 2018. Badan Pusat statistik 1–8 (2018).
Admin Web kependudukankalbar.com. Kependudukan Kalimantan Barat. Kependudukan Kalbar (2019).
Praditya, D. Tech In Asia. Tech In Asia (2019).
Soto, V. & Virseda, J. Prediction of socio-economic levels using cellphone records. In International Conference on User Modeling, Adaptation, and Personalization (eds Konstan, J. A. et al.) 377–388 (Springer, 2011). https://doi.org/10.1007/978-3-642-22362-4.
Chapter Google Scholar
Blumenstock, J., Cadamuro, G. & On, R. Predicting poverty and wealth from mobile phone metadata. Science 1979(350), 1073–1076 (2015).
Article ADS Google Scholar
Mellander, C., Lobo, J., Stolarick, K. & Matheson, Z. Night-time light data: A good proxy measure for economic activity?. PLoS ONE 10, 1–18 (2015).
Article Google Scholar
Jean, N. et al. Combining satellite imagery and machine learning to predict poverty. Science 1979(353), 790–794 (2016).
Article ADS Google Scholar
Wijaya, D. R. et al. Estimating city-level poverty rate based on e-commerce data with machine learning. Electron. Commerce Res. 22, 195–221 (2022).
Article Google Scholar
Soto, V., Frias-Martinez, V., Virseda, J. & Frias-Martinez, E. Prediction of socioeconomic levels using cell phone records. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 6787 LNCS, 377–388 (2011).
Steele, J. E. et al. Mapping poverty using mobile phone and satellite data. J. R. Soc. Interface 14, 20160690 (2017).
Article PubMed PubMed Central Google Scholar
Babenko, B., Hersh, J., Newhouse, D., Ramakrishnan, A. & Swartz, T. Poverty mapping using convolutional neural networks trained on high and medium resolution satellite images, with an application in Mexico. In 31st Conference on Neural Information Processing Systems (NIPS 2017) 1–4 (2017). https://doi.org/10.1109/VPPC.2005.1554579.
Christiaensen, L., Lanjouw, P., Luoto, J. & Stifel, D. Small area estimation-based prediction methods to track poverty: Validation and applications. The Journal of Economic Inequality 10, 267–297 (2012).
Article Google Scholar
Aulia, T. F., Wijaya, D. R., Hernawati, E. & Hidayat, W. Poverty level prediction based on E-commerce data using K-nearest neighbor and information-theoretical-based feature selection. In 2020 3rd International Conference on Information and Communications Technology, ICOIACT 2020 28–33 (2020). https://doi.org/10.1109/ICOIACT50329.2020.9332083.
Pangestu, A., Wijaya, D. R., Hernawati, E. & Hidayat, W. Wrapper feature selection for poverty level prediction based on E-commerce dataset. In 2020 International Conference on Data Science and Its Applications, ICoDSA 2020 (IEEE, 2020). https://doi.org/10.1109/ICoDSA50139.2020.9212999.
Taufid, N., Pratiwi, E. W., Fatmawati, A. D., Retnosari, L. & Santi, N. D. Penghitungan Dan Analisis Kemiskinan Makro Indonesia Tahun 2021. https://www.bps.go.id/id/publication/2021/11/30/9c24f43365d1e41c8619dfe4/penghitungan-dan-analisis-kemiskinan-makro-indonesia-tahun-2021.html (2021).
Saikhu, A., Arifin, A. Z. & Fatichah, C. Correlation and symmetrical uncertainty-based feature selection for multivariate time series classification. Int. J. Intell. Eng. Syst. 12, 129–137 (2019).
Google Scholar
Saranya, C. & Manikandan, G. A study on normalization techniques for privacy preserving data mining. Int. J. Eng. Technol. 5, 2701–2704 (2013).
Google Scholar
Aksu, G., Güzeller, C. O. & Eser, M. T. The effect of the normalization method used in different sample sizes on the success of artificial neural network model. Int. J. Assess. Tools Educ. 6, 170–192 (2019).
Article Google Scholar
KumarSingh, B., Verma, K. & Thoke, A. S. Investigations on impact of feature normalization techniques on classifier’s performance in breast tumor classification. Int. J. Comput. Appl. 116, 11–15 (2015).
Google Scholar
Li, J. et al. Feature selection: A data perspective. ACM Comput. Surv. 50, 1–45 (2017).
ADS Google Scholar
Jaganathan, P., Rajkumar, N. & Kuppuchamy, R. A comparative study of improved F-score with support vector machine and RBF network for breast cancer classification. Int. J. Mach. Learn. Comput. 2, 741–745 (2012).
Article Google Scholar
Liu, H. & Setiono, R. Chi2: Feature selection and discretization of numeric attributes. In Proceedings of the International Conference on Tools with Artificial Intelligence 388–391. Preprint at https://doi.org/10.1109/tai.1995.479783 (1995).
Hall, M. A. & Smith, L. A. Feature Selection for Machine Learning: Comparing a Correlation-Based Filter Approach to the Wrapper. In Proceedings of the Twelfth International Florida Artificial Intelligence Research Society Conference 235–239 (AAAI Press, 1999)
Auria, L. & Moro, R. A. Support vector machines (SVM) as a technique for solvency analysis. DIW Discussion Papers, No. 811 (2008).
Chang, C. C. & Lin, C. J. LIBSVM: A library for support vector machines. ACM Trans. Intell. Syst. Technol. 2, 1–39 (2011).
Article Google Scholar
Basak, D., Pal, S. & Patranabis, D. C. Support vector regression. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 10634 LNCS, 699–708 (2017).
Ridoean, J. A., Sarno, R., Sunaryo, D. & Wijaya, D. R. Music mood classification using audio power and audio harmonicity based on MPEG-7 audio features and support vector machine. In Proceeding—2017 3rd International Conference on Science in Information Technology: Theory and Application of IT for Education, Industry and Society in Big Data Era, ICSITech 2017 2018-Janua, 72–77 (2017).
Basak, D., Pal, S. & Patranabis, D. C. Support Vector Regression. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 10634 LNCS, 699–708 (2017).
Vapnik, V. N. Statistical Learning Theory (Wiley, 1998).
Google Scholar
Imandoust, S. B. & Bolandraftar, M. Application of K-nearest neighbor (KNN) approach for predicting economic events: Theoretical background. Int. J. Eng. Res. Appl. 3, 605–610 (2013).
Google Scholar
Ban, T., Zhang, R., Pang, S., Sarrafzadeh, A. & Inoue, D. Referential kNN regression for financial time series forecasting. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 8226 LNCS, 601–608 (2013).
Schneider, A., Hommel, G. & Blettner, M. Linear regression analysis—Part 14 of a series on evaluation of scientific publications. Dtsch Arztebl 107, 776–782 (2010).
Google Scholar
Wijaya, D. R., Sarno, R. & Zulaika, E. Noise filtering framework for electronic nose signals: An application for beef quality monitoring. Comput. Electron. Agric. 157, 305–321 (2019).
Article Google Scholar
Volodymyr Agafonkin. Leaflet. Preprint at https://leafletjs.com/ (2019).

Download references

Acknowledgements

This work is an extended research project stemming from Research Dive number 7 with the theme “AI and Machine Learning for Estimating Poverty”, conducted at Pulse Lab Jakarta. This research was also funded by Telkom University from a basic and applied research scheme. We would like to thank the students who took part in this research, including Sherla, Sherli, Ade, and Tiara.

Funding

This research was funded by Telkom University.

Author information

Authors and Affiliations

School of Applied Science, Telkom University, Bandung, Indonesia
Dedy Rahman Wijaya, Raden Ilham Fadhilah Ibadurrohman, Elis Hernawati & Wawa Wikusna

Authors

Dedy Rahman Wijaya
View author publications
You can also search for this author in PubMed Google Scholar
Raden Ilham Fadhilah Ibadurrohman
View author publications
You can also search for this author in PubMed Google Scholar
Elis Hernawati
View author publications
You can also search for this author in PubMed Google Scholar
Wawa Wikusna
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

D.R.W.: Conceptualization, Methodology, Software, Validation, Formal analysis, Investigation, Resources, Data curation, Writing—review and editing. R.I.F.: Writing—original draft, Software, experiment. E.L.T. and W.I.U.: Writing—review and editing, Validation. All authors reviewed the manuscript.

Corresponding author

Correspondence to Dedy Rahman Wijaya.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Wijaya, D.R., Ibadurrohman, R.I.F., Hernawati, E. et al. Poverty prediction using E-commerce dataset and filter-based feature selection approach. Sci Rep 14, 3088 (2024). https://doi.org/10.1038/s41598-024-52752-7

Download citation

Received: 02 February 2023
Accepted: 23 January 2024
Published: 07 February 2024
DOI: https://doi.org/10.1038/s41598-024-52752-7

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.