Poverty prediction using E-commerce dataset and filter-based feature selection approach

Poverty is a problem that occurs in many countries, notably in Indonesia. The common methods used to obtain poverty information are surveys and censuses. However, this process takes a long time and uses a lot of human resources. On the other hand, governments and policymakers need a faster approach to know social-economic conditions for area development plans. Hence, in this paper, we develop e-commerce data and machine learning algorithms as a proxy for poverty levels that can provide faster information than surveys or censuses. The e-commerce dataset is used and this high-dimensional data becomes a challenge. Hence, feature selection algorithms are employed to determine the best features before building a machine learning model. Furthermore, three machine learning algorithms such as support vector regression, linear regression, and k-nearest neighbor are compared to predict the poverty rate. Hence, the contribution of this paper is to propose the combination of statistical-based feature selection and machine learning algorithms to predict the poverty rate based on e-commerce data. According to the experimental results, the combination of f-score feature selection and support vector regression surpasses other methods. It shows that e-commerce data and machine learning algorithms can be potentially used as a proxy for predicting poverty.

and rare.Furthermore, the one challenge of e-commerce data utilization is high-dimensional data that must be reduced for performance improvement of machine learning algorithms.Our hypothesis is a feature subset produced by feature selection algorithm can improve the performance of machine learning algorithm.
According to this background, we propose a solution to complement survey and census by using e-commerce data and machine learning algorithms, especially in Indonesia.The proposed method can be used as a fast and low-cost solution to predict the poverty level.It can be used by governments and policymakers as a baseline to determine development policies.The e-commerce dataset contains the calculation of the number of purchases from a particular area, so, it can be seen in that region whether the area is prosperous or vice versa.Thus, the contribution of this research is to propose the combination of statistical-based feature selection and machine learning algorithms to build a model for predicting poverty in Indonesia using a dataset that represents Indonesian people's needs based on e-commerce data.In this study, we have several motivations as follows: 1.This study used e-commerce data from one of the largest e-commerce companies in Indonesia.By using the e-commerce data, the data source can be rapidly updated to complement the National Socio-economic Survey that records poverty every 5 years.Real-time poverty prediction can help governments and policymakers to determine the priority of development plans.2. Existing studies utilized several methods to predict poverty such as using phone records 9 , satellite imagery 10,11 , and small area estimation 12 .However, these studies used several assumptions to predict poverty.On the other hand, we use an e-commerce dataset obtained from one of the e-commerce platforms in Indonesia.We argue that e-commerce data can represent the economic conditions in a particular area.
Hence, to the best of our knowledge, the study of poverty estimation using e-commerce data and machine learning algorithms is relatively new and rare.In our previous works, we have performed poverty prediction using machine learning algorithms.However, it only uses one feature selection algorithm that makes the study quite limited 8,13 .In addition, wrapper-based feature selection algorithm were also utilized, but it could not provide satisfactory results 14 .Hence, to emphasize the originality, in this paper, we use three statistical-based feature selection and three machine learning algorithms to find the best model for poverty estimation.This approach not only can be used for Indonesian data but also potentially adopted for e-commerce data from other countries.

Methods
The dataset used in this research is sample advertising data from one of the e-commerce companies in Indonesia which was regenerated and changed in value.There are eight items such as motorbikes, cars, apartments for sale, apartments for rent, houses for sale, houses for rent, land for sale, and land for rent in 2016.To measure the level of poverty, a poverty limit/line is needed.The poverty line reflects the rupiah value of the minimum expenditure a person needs to fulfill his basic life needs a month, both food and non-food needs.These items are included in the list of basic needs commodities 15 .The advertisements used for this study are from Java Island which is the most contributed island for posting the advertisement in total 18,881,913 advertisements in 118 cities/districts.Table 1 shows the detail of the dataset used in this study.
From each item in Table 1, four aspects were aggregated per city such as the number of items sold, selling price, number of buyers, and number of viewers for three statistical features (average, sum, and standard deviation).Initially, the used e-commerce dataset contains 96 numeric features and an actual poverty rate as a continuous label.The ground truth of poverty rate was referred to the official poverty rate issued by BPS (Statistics Indonesia) in the current year.Thus, we have 8 items × 4 aspects × 3 statistical features = 96 features according to this way as shown in Fig. 3. Our dataset contains 96 columns (features) × 118 rows (cities).The extraction of items and aspects from the dataset is shown in Fig. 1.Because the data dimension is relatively huge, we used statistical-based feature selection algorithms to select the most relevant features and machine learning algorithms to train and build prediction models using data from the selected feature.Generally, the prediction process using machine learning algorithms and statistical-based feature selection has five stages.It is started with pre-processing data, normalization, feature selection, training model, and evaluation.
Figure 2 shows the flow of the proposed method.It starts by extracting e-commerce data into items, aspects, and calculates the statistic aggregation values.The extracted data still contains dirty data.Thus, the dirty data needs to be cleaned up.Clean data will be normalized to be scaled to the same scale.Data normalization is necessary to uniform the scale.So, it is converted into [0, 1].In this research work, the min-max normalization method is used for the normalization method.The min-max method is one of the normalization techniques to standardize the dataset using linear transformation 16 .This normalization method transforms e-commerce data into a fixed range.This normalization method ensures that a huge data range is constrained within a specific range.It transforms a value X 0 to X p which fits in the specified range.The criteria are given by Eq. ( 1),

Feature selection algorithms
Many datasets have a high dimension such as the marketplace, healthcare, social media, etc.However, these highdimensional data cause a problem for the algorithm that was designed for low-dimensional space.They can also increase the memory usage of the computer.To deal with high-dimensional data, this paper uses several filterbased feature selection algorithms such as f-score, chi-square, and correlation-based feature selection (CFS).The filter method does not rely on any learning algorithm.They rely on data characteristics to assess the importance of features.Filter methods are usually more computationally efficient than other methods 20 .F-score is known as a simple technique for measuring discrimination of two sets of real numbers 21 .Given training vector x k , k = 1, 2,…, m, if the number of positive and negative instances are n + and n -respectively, then f-score of the ith feature is defined as Eq. ( 2), where x (+) i , x i , x are the average of the ith feature of the whole, positive, and negative datasets, respectively.The numerator indicates the discrimination between the positive and negative sets, and the denominator indicates the one within each of the 2 sets.The greater f-score value indicates the feature is more discriminative.
Chi-square utilizes the test of independence to assess if the feature is independent of the class label 22 .Chisquare criteria can be defined in Eq. ( 3), where c = number of classes, n js = number of patterns in the jth interval, sth class, R i = number of patterns in the jth interval = c s=1 n js , K s = number of patterns in the sth class = 2 j=1 n js , N = total number of patterns = 2 j=1 R i , µ js = expected frequency of n js = R i × K s N .If R i and K s is 0, µ js is set to 0.1.A higher chi-square value indicates that the feature is relatively important.However, Chi-Square algorithm needs discrete values to perform feature selection.Hence, in this experiment, the poverty rate as continuous values is rounded to get discrete values.Moreover, the basic idea of CFS algorithm is to use a correlation-based heuristic to evaluate the worth of a feature subset 23 .The CFS criteria are defined in Eq. ( 4), where the score shows the heuristic "merit" of a feature subset s containing k features, r cf is the mean of class correlation where f ∈ s , rff is the average feature intercorrelation.The basic idea is the stronger correlation with the class label and the weaker intercorrelated to each other is the better feature subset.

Machine learning algorithms
Mainly, in this research, we used a support vector regression (SVR) algorithm because the SVM for classification already showed good results in medical diagnostics, optical character recognition, electric load forecasting, and other fields 24 .Also, the SVR algorithm is the most common application form of SVMs to build a machine learning regression model 25,26 .The advantages of SVM are a unique solution, not sensitive to small changes of parameters, and providing increased performance 27 .In addition, the SVM is a machine learning algorithm that implements structural risk minimalization to obtain good generalization on a limited number of learning patterns 28 .The data from several stages before will be used to train in this stage.Considering a training dataset, {( − → x 1 , z 1 ), …, ( − → x i , z i )} that corresponds to features where − → x i , z i are feature vector and target output, respectively.The standard criteria of SVR are given in Eq. ( 5) through Eq. ( 9) 29 .
(1) where w, C, ξ , ε, and b as slope matrix, regularization parameter slack variable for soft margin, tolerance margin, and the intercept/bias, respectively.The dual problem is where α − α * denotes Lagrangian multipliers.
The approximate function after solving the problem in Eq. ( 8) is the output from the model is α * − α.
To ensure the model made has good parameter values, in this experiment, grid search was performed to determine kernel between RBF and polynomial, epsilon value within [0.1, 0.5, 1.0, 1.5, 2.0], parameter C within [1, 10, 100, 1000], and gamma within [0.001, 0.0001].LIBSVM was used as the library for SVR.From grid search, we determined to use kernel, epsilon, C, and gamma parameters are RBF, 0.5, 10, 0.001, respectively.Also, in this research, we used k-nearest neighbor regression (k-NN) and linear regression (LR) to compare with SVR performance.The k-NN algorithm is a method for classifying objects based on the closest training example in feature space 30 .K-NN is a type of lazy learning where the function is only approximated locally.The same method can be used for regression by assigning the property value for the object to be the average of the values of its K nearest neighbor.K-NN is widely adopted for classification and regression because of its simplicity and intuitiveness 31 .While LR is used to study the linear relationship between a dependent variable and one or more independent variables 32 .

Evaluation
For evaluation, the leave-one-out method was used for cross-validation.In this experiment, we used the root mean squared error (RMSE) and R-squared (R 2 ) to measure the performance of the machine learning model.To measure the error between actual data and prediction vectors, the RMSE is used.The best prediction results are obtained if the RMSE value is low.It means the difference between actual and prediction data is low.Equation (10) shows the equation of RMSE, where, y, y, L indicate actual value, prediction value, and data length, respectively.In addition, to measure the performance, we used R 2 as shown in Eq. (11) to show the parts of the variance of the actual data.R 2 will assess the regression model and whether the model can correctly predict the actual value.The R 2 value will be ranged from 0 to 1 and if the value is nearly 1 or even 1, it means the model is almost perfect in predicting the actual data.Otherwise, if the value equals 0 or negative, it means the model does not follow the trend of the actual data 33 .

Results and discussion
F-score, chi-square, and CFS feature selection are used to select some of the most relevant features.The feature selection was used to rank all the features in the dataset.The result of this stage is to rank the feature indexes.Tables 2 and 3 is feature selection result for f-score and chi-square, respectively.Also, we found that CFS produced an inconsistent index ranking in every experiment.The experimental results are shown in Table 4.
According to the experiment, the first six feature index is consistent with the ranking, while the others show inconsistency.Thus, we decided that the results of the CFS algorithm cannot be used for building a machine learning model.We only used f-score and chi-square feature selection for building the machine learning model.After the feature is ranked, we also try to find the best result by conducting several prediction experiments from the lowest number of features, starting from 10 features, 20 features, 30 features, 40 features onwards to 96 features.Then, the evaluation process is included in every experiment using R 2 and RMSE.Also, we used LR and k-NN besides SVR to compare and prove that SVR is the best method.Tables 5 and 6 show SVR prediction results without feature selection and with feature selection, respectively.
Tables 5 and 6 show the results of our experiments in this study.The experiment results are based on the number of features, the machine learning algorithm, and the feature selection algorithm.Bold text indicates the best results.Table 5 shows the comparison of machine learning algorithms without using feature selection and the best R 2 score obtained by SVR is 0.42321.Table 6 shows the SVR experiments using feature selection and the best R 2 value is 0.42765 using f-score feature selection and 90 features as shown in Table 2.According to these (7)  min experimental results, a feature selection algorithm can improve the performance of machine learning to handle high-dimensional data of the e-commerce dataset.
The best models of SVR, k-NN regressor, and LR are visualized in Fig. 3a-c, respectively.The blue line and the yellow lines are the trend of actual data (y = x) and the threshold (± 1.5), respectively.The data between the error margins means the data has less prediction error and vice versa.
The visualized data of SVR result shows more data with less error than other models.Thus, it makes SVR better than other models that we compared in this paper.The worst data visualization is displayed while predicting using linear regression, some data are far from the actual data.It means that LR produces the worst model.
Furthermore, the results are also visualized in the choropleth map in Figs. 4 and 5. Figure 4 shows the Java Island actual poverty rate mapping.They were generated using Leaflet 1.6.0 34.Leaflet is an open source JavaScript library used to build web mapping applications.The darker color indicates a higher poverty rate and vice versa.www.nature.com/scientificreports/ The predicted poverty mapping is displayed in Fig. 5. From the actual data in Fig. 4, predicted data in Fig. 5 shows a lower level of poverty rate.This result indicates that the prediction model produces underestimation results.Finally, Table 7 presents a detailed comparison between the actual and predicted poverty rates at the city level.
The table provides a breakdown of each poverty percentage value for a comprehensive analysis.

Conclusion
E-commerce data has the potential to predict poverty.Hence, we try to use the machine learning algorithm to model the e-commerce data.Three feature selection algorithms were used to select the best features.Then, support vector regression is used to predict the poverty rate.The experimental results show that using all features cannot guarantee good performance.F-score shows the best result among the three other statistical-based feature selection algorithms evaluated by using RMSE and R 2 .It produces the highest R 2 value and the lowest RMSE value.This result indicates that a feature selection algorithm can give performance improvement of a machine  learning algorithm for poverty prediction.Besides, we found that the CFS feature selection shows an unstable feature rank.Moreover, the weakness of the proposed method still has difficulty in predicting regions with a higher poverty rate.The model shows its advantages which are error minimization compared to other algorithms.Therefore, the performance gap between the SVR model and the other machine learning e.g.K-NN and LR models is quite large.Overall, results show the potential of implementation of e-commerce data, feature selection algorithm, and machine learning algorithm for poverty estimation.Governments, policymakers, and researchers can consider e-commerce datasets as a proxy for socio-economic conditions.The study has limitations because it uses only 1 year of data.Hence, more data are needed to improve the machine learning model's performance.The additional data might produce a better model, especially for underestimated results.For future research, larger data must be utilized for a more accurate poverty model.However, the major limitation of e-commerce data is data accessibility and confidentiality, making it difficult to get.

Acronyms
Acronyms used in this paper can be seen in Table 8.

Figure 1 .
Figure 1.Illustration of items and aspects extraction from dataset.

Figure 2 .
Figure 2. Flow of the proposed method.

Figure 4 .
Figure 4. Actual poverty rate mapping in each city in Java island.

Figure 5 .
Figure 5. Predicted poverty mapping based on cities on Java island.

Table 5 .
Prediction experiments without feature selection.