Introduction

In the last decades, poverty has been a common problem in developing countries. In March 2018, the Central Bureau of Statistics reported that the number of poor people in Indonesia reached 25.95 million people (9.82 percent). This number decreased by around 633.2 thousand people compared to September 2017 of 26.58 million people (10.12 percent)1. The data was obtained by the Central Bureau of Statistics by conducting a national socio-economic survey commonly abbreviated as SUSENAS. It is a household-based survey that collects information on socio-economic characteristics such as education, health, family planning, travel information, crime, housing, social protection, and household consumption and expenditure2. SUSENAS survey can be estimated to take a long time and there will be a period between one survey and another. In terms of cost, it is also very expensive. Hence, how the government estimates poverty to achieve better program targets is not an easy task. On the other hand, the digital revolution continues to generate abundant data, which provides new opportunities to capture information about socio-economic conditions at various levels of abstraction to summarize development progress. These data can be used to monitor changes in the prosperity level, as well as to measure the impact of government programs. One prospective data source to capture socio-economic conditions is e-commerce data. The e-commerce market in Indonesia is one of the largest in Southeast Asia with a contribution of up to fifty percent of all transactions in the Southeast Asian region. The growth of the population of internet users can increase e-commerce penetration in Indonesia so that its contribution to the economy has the potential to continue to increase. Even without taking into account the B2B service sector, the gross merchandise value of the e-commerce market in Indonesia is projected to grow around eight times by 20223.

In the last few years, several data sources have been reported for poverty estimation such as satellite imagery and call detail records (CDRs)4,5,6,7. However, these datasets have assumptions for example the light intensity of nightlight data from satellite imagery reflects high economic activity in a particular area. Moreover, high mobile phone credit is related to welfare in CDRs data. In contrast, e-commerce data can reflect the real expenditure for necessities at the household level without assumptions8. Thus, this dataset has more complied with the formal calculation of poverty level. Nevertheless, the study of e-commerce data for poverty prediction is relatively new and rare. Furthermore, the one challenge of e-commerce data utilization is high-dimensional data that must be reduced for performance improvement of machine learning algorithms. Our hypothesis is a feature subset produced by feature selection algorithm can improve the performance of machine learning algorithm.

According to this background, we propose a solution to complement survey and census by using e-commerce data and machine learning algorithms, especially in Indonesia. The proposed method can be used as a fast and low-cost solution to predict the poverty level. It can be used by governments and policymakers as a baseline to determine development policies. The e-commerce dataset contains the calculation of the number of purchases from a particular area, so, it can be seen in that region whether the area is prosperous or vice versa. Thus, the contribution of this research is to propose the combination of statistical-based feature selection and machine learning algorithms to build a model for predicting poverty in Indonesia using a dataset that represents Indonesian people’s needs based on e-commerce data. In this study, we have several motivations as follows:

  1. 1.

    This study used e-commerce data from one of the largest e-commerce companies in Indonesia. By using the e-commerce data, the data source can be rapidly updated to complement the National Socio-economic Survey that records poverty every 5 years. Real-time poverty prediction can help governments and policymakers to determine the priority of development plans.

  2. 2.

    Existing studies utilized several methods to predict poverty such as using phone records9, satellite imagery10,11, and small area estimation12. However, these studies used several assumptions to predict poverty. On the other hand, we use an e-commerce dataset obtained from one of the e-commerce platforms in Indonesia. We argue that e-commerce data can represent the economic conditions in a particular area.

Hence, to the best of our knowledge, the study of poverty estimation using e-commerce data and machine learning algorithms is relatively new and rare. In our previous works, we have performed poverty prediction using machine learning algorithms. However, it only uses one feature selection algorithm that makes the study quite limited8,13. In addition, wrapper-based feature selection algorithm were also utilized, but it could not provide satisfactory results14. Hence, to emphasize the originality, in this paper, we use three statistical-based feature selection and three machine learning algorithms to find the best model for poverty estimation. This approach not only can be used for Indonesian data but also potentially adopted for e-commerce data from other countries.

Methods

The dataset used in this research is sample advertising data from one of the e-commerce companies in Indonesia which was regenerated and changed in value. There are eight items such as motorbikes, cars, apartments for sale, apartments for rent, houses for sale, houses for rent, land for sale, and land for rent in 2016. To measure the level of poverty, a poverty limit/line is needed. The poverty line reflects the rupiah value of the minimum expenditure a person needs to fulfill his basic life needs a month, both food and non-food needs. These items are included in the list of basic needs commodities15. The advertisements used for this study are from Java Island which is the most contributed island for posting the advertisement in total 18,881,913 advertisements in 118 cities/districts. Table 1 shows the detail of the dataset used in this study.

Table 1 Dataset description.

From each item in Table 1, four aspects were aggregated per city such as the number of items sold, selling price, number of buyers, and number of viewers for three statistical features (average, sum, and standard deviation). Initially, the used e-commerce dataset contains 96 numeric features and an actual poverty rate as a continuous label. The ground truth of poverty rate was referred to the official poverty rate issued by BPS (Statistics Indonesia) in the current year. Thus, we have 8 items × 4 aspects × 3 statistical features = 96 features according to this way as shown in Fig. 3. Our dataset contains 96 columns (features) × 118 rows (cities). The extraction of items and aspects from the dataset is shown in Fig. 1. Because the data dimension is relatively huge, we used statistical-based feature selection algorithms to select the most relevant features and machine learning algorithms to train and build prediction models using data from the selected feature. Generally, the prediction process using machine learning algorithms and statistical-based feature selection has five stages. It is started with pre-processing data, normalization, feature selection, training model, and evaluation.

Figure 1
figure 1

Illustration of items and aspects extraction from dataset.

Figure 2 shows the flow of the proposed method. It starts by extracting e-commerce data into items, aspects, and calculates the statistic aggregation values. The extracted data still contains dirty data. Thus, the dirty data needs to be cleaned up. Clean data will be normalized to be scaled to the same scale. Data normalization is necessary to uniform the scale. So, it is converted into [0, 1]. In this research work, the min–max normalization method is used for the normalization method. The min–max method is one of the normalization techniques to standardize the dataset using linear transformation16. This normalization method transforms e-commerce data into a fixed range. This normalization method ensures that a huge data range is constrained within a specific range. It transforms a value X0 to Xp which fits in the specified range. The criteria are given by Eq. (1),

$${X}_{p}=\frac{{X}_{o}-{\text{min}}(x)}{{\text{max}}\left(x\right)-{\text{min}}(x)}$$
(1)

where Xp is the new value for variable X, XO is the current value for variable X. min(x) and max(x) are the minimum and the maximum data points in the dataset, respectively. We used min–max normalization because of its performance for having less number of misclassification errors. Also, it has been reported for satisfactory performance in supervised and unsupervised learning17,18,19.

Figure 2
figure 2

Flow of the proposed method.

Feature selection algorithms

Many datasets have a high dimension such as the marketplace, healthcare, social media, etc. However, these high-dimensional data cause a problem for the algorithm that was designed for low-dimensional space. They can also increase the memory usage of the computer. To deal with high-dimensional data, this paper uses several filter-based feature selection algorithms such as f-score, chi-square, and correlation-based feature selection (CFS). The filter method does not rely on any learning algorithm. They rely on data characteristics to assess the importance of features. Filter methods are usually more computationally efficient than other methods20.

F-score is known as a simple technique for measuring discrimination of two sets of real numbers21. Given training vector xk, k = 1, 2,…, m, if the number of positive and negative instances are n+ and n respectively, then f-score of the ith feature is defined as Eq. (2),

$${F}_{i}= \frac{{\left({\overline{x} }_{i}^{\left(+\right)}- {\overline{x} }_{i}\right)}^{2}+ {\left({\overline{x} }_{i}^{\left(-\right)}- {\overline{x} }_{i}\right)}^{2}}{\frac{1}{{n}_{+}-1{\sum }_{k=1}^{n+}{\left({x}_{k,i}^{\left(+\right)}-{\overline{x} }_{i}^{+}\right)}^{2}}+ \frac{1}{{n}_{-}-1}{\sum }_{k,i}^{n-}{\left({x}_{k,i}^{\left(-\right)}-{\overline{x} }_{i}^{-}\right)}^{2}}$$
(2)

where \({\overline{x} }_{i}^{\left(+\right)}\), \({\overline{x} }_{i}\), \(\overline{x }\) are the average of the ith feature of the whole, positive, and negative datasets, respectively. The numerator indicates the discrimination between the positive and negative sets, and the denominator indicates the one within each of the 2 sets. The greater f-score value indicates the feature is more discriminative.

Chi-square utilizes the test of independence to assess if the feature is independent of the class label22. Chi-square criteria can be defined in Eq. (3),

$${x}^{2}= \sum_{j=1}^{2}\sum_{s=1}^{c}\frac{{({n}_{js}- {\mu }_{js})}^{2}}{{\mu }_{js}}$$
(3)

where c = number of classes, \({n}_{js}\) = number of patterns in the jth interval, sth class, \({R}_{i}\)= number of patterns in the jth interval = \({\sum }_{s=1}^{c}{n}_{js}\), \({K}_{s}\)= number of patterns in the sth class = \({\sum }_{j=1}^{2}{n}_{js}\), \(N\) = total number of patterns = \({\sum }_{j=1}^{2}{R}_{i}\), \({\mu }_{js}\) = expected frequency of \({n}_{js}\) = \({R}_{i} \times \frac{{K}_{s}}{N}.\)

If \({R}_{i}\) and \({K}_{s}\) is 0, \({\mu }_{js}\) is set to 0.1. A higher chi-square value indicates that the feature is relatively important. However, Chi-Square algorithm needs discrete values to perform feature selection. Hence, in this experiment, the poverty rate as continuous values is rounded to get discrete values. Moreover, the basic idea of CFS algorithm is to use a correlation-based heuristic to evaluate the worth of a feature subset23. The CFS criteria are defined in Eq. (4),

$${merit}_{s}= \frac{k\overline{{r }_{cf}}}{\sqrt{k+k(k-1)}\overline{rff} }$$
(4)

where the score shows the heuristic “merit” of a feature subset s containing k features, \(\overline{{r }_{cf}}\) is the mean of class correlation where \(f \in s\), \(\overline{rff }\) is the average feature intercorrelation. The basic idea is the stronger correlation with the class label and the weaker intercorrelated to each other is the better feature subset.

Machine learning algorithms

Mainly, in this research, we used a support vector regression (SVR) algorithm because the SVM for classification already showed good results in medical diagnostics, optical character recognition, electric load forecasting, and other fields24. Also, the SVR algorithm is the most common application form of SVMs to build a machine learning regression model25,26. The advantages of SVM are a unique solution, not sensitive to small changes of parameters, and providing increased performance27. In addition, the SVM is a machine learning algorithm that implements structural risk minimalization to obtain good generalization on a limited number of learning patterns28. The data from several stages before will be used to train in this stage. Considering a training dataset, {(\(\overrightarrow{{x}_{1}},{z}_{1}\)), …, (\(\overrightarrow{{x}_{i}},{z}_{i}\))} that corresponds to features where \(\overrightarrow{{x}_{i}}, {z}_{i}\) are feature vector and target output, respectively. The standard criteria of SVR are given in Eq. (5) through Eq. (9)29.

$$\underset{w,b,\xi ,{\xi }^{*}}{{\text{min}}}\mathit{ }\frac{1}{2}{w}^{t}w+C\sum_{i=1}^{l}{\xi }_{i}+ C\sum_{i=1}^{l}{\xi }_{i}^{*}$$
(5)
$$\begin{aligned} {\text{subject to}}\,\,\, & w^{t} \emptyset \left( { x_{i} } \right) + b - z_{i} \le \varepsilon + \xi_{i} , \\ & z_{i} - w^{t} \emptyset \left( {x_{i} } \right) - b \le \varepsilon + \xi_{i}^{*} , \\ & \xi_{i} , \xi_{i}^{*} \ge 0,\quad i = 1, \ldots ,l. \\ \end{aligned}$$
(6)

where \(w, C, \xi ,\varepsilon ,\) and \(b\) as slope matrix, regularization parameter slack variable for soft margin, tolerance margin, and the intercept/bias, respectively. The dual problem is

$$\underset{\propto ,{\propto }^{*}}{\mathrm{min }}\frac{1}{2}{\left(\alpha - {\alpha }^{*}\right)}^{t}Q\left(\alpha - {\alpha }^{*}\right)+\varepsilon \sum_{i=1}^{l}\left({\alpha }_{i} - {\alpha }_{i}^{*}\right) +\sum_{i=1}^{l}{z}_{i}\left({\alpha }_{i} - {\alpha }_{i}^{*}\right)$$
(7)
$$\begin{aligned} {\text{subject to}}\,\,\,\, & e^{t} \left( {\alpha {-} \alpha^{*} } \right) = 0, \\ & 0 \le \alpha_{i} ,\alpha_{i}^{*} \le C,\quad i = 1, \ldots , l, \\ \end{aligned}$$
(8)

where \(\alpha - {\alpha }^{*}\) denotes Lagrangian multipliers. \({Q}_{ij}=K\left({x}_{i},{x}_{j}\right)\equiv \varnothing {\left({x}_{i}\right)}^{t}\varnothing \left({x}_{j}\right).\) The approximate function after solving the problem in Eq. (8) is

$$\sum_{i=1}^{l}\left({-\alpha }_{i}+ {\alpha }_{i}^{*}\right)K\left({x}_{i}, x\right)+b$$
(9)

the output from the model is \({\alpha }^{*}- \alpha\).

To ensure the model made has good parameter values, in this experiment, grid search was performed to determine kernel between RBF and polynomial, epsilon value within [0.1, 0.5, 1.0, 1.5, 2.0], parameter C within [1, 10, 100, 1000], and gamma within [0.001, 0.0001]. LIBSVM was used as the library for SVR. From grid search, we determined to use kernel, epsilon, C, and gamma parameters are RBF, 0.5, 10, 0.001, respectively.

Also, in this research, we used k-nearest neighbor regression (k-NN) and linear regression (LR) to compare with SVR performance. The k-NN algorithm is a method for classifying objects based on the closest training example in feature space30. K-NN is a type of lazy learning where the function is only approximated locally. The same method can be used for regression by assigning the property value for the object to be the average of the values of its K nearest neighbor. K-NN is widely adopted for classification and regression because of its simplicity and intuitiveness31. While LR is used to study the linear relationship between a dependent variable and one or more independent variables32.

Evaluation

For evaluation, the leave-one-out method was used for cross-validation. In this experiment, we used the root mean squared error (RMSE) and R-squared (R2) to measure the performance of the machine learning model. To measure the error between actual data and prediction vectors, the RMSE is used. The best prediction results are obtained if the RMSE value is low. It means the difference between actual and prediction data is low. Equation (10) shows the equation of RMSE,

$$RMSE\left(y, \widehat{y}\right)= \sqrt{\frac{{\sum }_{i=1}^{L}{({y}_{i}- {\widehat{y}}_{i})}^{2}}{L}}$$
(10)

where, \(y, \widehat{y}, L\) indicate actual value, prediction value, and data length, respectively. In addition, to measure the performance, we used R2 as shown in Eq. (11) to show the parts of the variance of the actual data. R2 will assess the regression model and whether the model can correctly predict the actual value. The R2 value will be ranged from 0 to 1 and if the value is nearly 1 or even 1, it means the model is almost perfect in predicting the actual data. Otherwise, if the value equals 0 or negative, it means the model does not follow the trend of the actual data33.

$${R}^{2}\left(y, \widehat{y}\right)= 1- \frac{\sum_{i=1}^{L}{({y}_{i}- {\widehat{y}}_{i})}^{2}}{\sum_{i=1}^{L}{({y}_{i}- {\overline{y} }_{i})}^{2}}$$
(11)

Results and discussion

F-score, chi-square, and CFS feature selection are used to select some of the most relevant features. The feature selection was used to rank all the features in the dataset. The result of this stage is to rank the feature indexes. Tables 2 and 3 is feature selection result for f-score and chi-square, respectively. Also, we found that CFS produced an inconsistent index ranking in every experiment. The experimental results are shown in Table 4.

Table 2 F-score feature ranking.
Table 3 Chi-square feature ranking.
Table 4 CFS feature ranking experiment.

According to the experiment, the first six feature index is consistent with the ranking, while the others show inconsistency. Thus, we decided that the results of the CFS algorithm cannot be used for building a machine learning model. We only used f-score and chi-square feature selection for building the machine learning model. After the feature is ranked, we also try to find the best result by conducting several prediction experiments from the lowest number of features, starting from 10 features, 20 features, 30 features, 40 features onwards to 96 features. Then, the evaluation process is included in every experiment using R2 and RMSE. Also, we used LR and k-NN besides SVR to compare and prove that SVR is the best method. Tables 5 and 6 show SVR prediction results without feature selection and with feature selection, respectively.

Table 5 Prediction experiments without feature selection.
Table 6 Prediction experiments with feature selection.

Tables 5 and 6 show the results of our experiments in this study. The experiment results are based on the number of features, the machine learning algorithm, and the feature selection algorithm. Bold text indicates the best results. Table 5 shows the comparison of machine learning algorithms without using feature selection and the best R2 score obtained by SVR is 0.42321. Table 6 shows the SVR experiments using feature selection and the best R2 value is 0.42765 using f-score feature selection and 90 features as shown in Table 2. According to these experimental results, a feature selection algorithm can improve the performance of machine learning to handle high-dimensional data of the e-commerce dataset.

The best models of SVR, k-NN regressor, and LR are visualized in Fig. 3a–c, respectively. The blue line and the yellow lines are the trend of actual data (y = x) and the threshold (± 1.5), respectively. The data between the error margins means the data has less prediction error and vice versa.

Figure 3
figure 3

Data visualization of (a) SVR, (b) k-NN regression, (c) linear regression.

The visualized data of SVR result shows more data with less error than other models. Thus, it makes SVR better than other models that we compared in this paper. The worst data visualization is displayed while predicting using linear regression, some data are far from the actual data. It means that LR produces the worst model.

Furthermore, the results are also visualized in the choropleth map in Figs. 4 and 5. Figure 4 shows the Java Island actual poverty rate mapping. They were generated using Leaflet 1.6.034. Leaflet is an open source JavaScript library used to build web mapping applications. The darker color indicates a higher poverty rate and vice versa. The predicted poverty mapping is displayed in Fig. 5. From the actual data in Fig. 4, predicted data in Fig. 5 shows a lower level of poverty rate. This result indicates that the prediction model produces underestimation results. Finally, Table 7 presents a detailed comparison between the actual and predicted poverty rates at the city level. The table provides a breakdown of each poverty percentage value for a comprehensive analysis.

Figure 4
figure 4

Actual poverty rate mapping in each city in Java island.

Figure 5
figure 5

Predicted poverty mapping based on cities on Java island.

Table 7 Comparison of actual and predicted poverty rates.

Conclusion

E-commerce data has the potential to predict poverty. Hence, we try to use the machine learning algorithm to model the e-commerce data. Three feature selection algorithms were used to select the best features. Then, support vector regression is used to predict the poverty rate. The experimental results show that using all features cannot guarantee good performance. F-score shows the best result among the three other statistical-based feature selection algorithms evaluated by using RMSE and R2. It produces the highest R2 value and the lowest RMSE value. This result indicates that a feature selection algorithm can give performance improvement of a machine learning algorithm for poverty prediction. Besides, we found that the CFS feature selection shows an unstable feature rank. Moreover, the weakness of the proposed method still has difficulty in predicting regions with a higher poverty rate. The model shows its advantages which are error minimization compared to other algorithms. Therefore, the performance gap between the SVR model and the other machine learning e.g. K-NN and LR models is quite large. Overall, results show the potential of implementation of e-commerce data, feature selection algorithm, and machine learning algorithm for poverty estimation. Governments, policymakers, and researchers can consider e-commerce datasets as a proxy for socio-economic conditions. The study has limitations because it uses only 1 year of data. Hence, more data are needed to improve the machine learning model's performance. The additional data might produce a better model, especially for underestimated results. For future research, larger data must be utilized for a more accurate poverty model. However, the major limitation of e-commerce data is data accessibility and confidentiality, making it difficult to get.

Acronyms

Acronyms used in this paper can be seen in Table 8.

Table 8 Acronym list.