A machine learning model with human cognitive biases capable of learning from small and biased datasets

Human learners can generalize a new concept from a small number of samples. In contrast, conventional machine learning methods require large amounts of data to address the same types of problems. Humans have cognitive biases that promote fast learning. Here, we developed a method to reduce the gap between human beings and machines in this type of inference by utilizing cognitive biases. We implemented a human cognitive model into machine learning algorithms and compared their performance with the currently most popular methods, naïve Bayes, support vector machine, neural networks, logistic regression and random forests. We focused on the task of spam classification, which has been studied for a long time in the field of machine learning and often requires a large amount of data to obtain high accuracy. Our models achieved superior performance with small and biased samples in comparison with other representative machine learning methods.

Machine learning has been widely studied and has contributed to technologies used in our everyday life, such as automatic translation, image recognition and spam classification 1 . One notable method of machine learning is supervised learning, which generalizes the concept of a problem from a set of labeled training data. For example, a spam classifier uses training data in the form of a large sample of email texts that have been previously labeled into two classes, spam and ham (i.e., non-spam), to classify new uncategorized emails. The early representative machine learning methods include perceptron 2 , logistic regression (LR) 3 and the nearest-neighbor rule 4 . Neural networks (NN) 5 and support vector machine (SVM) 6 were later proposed based on the perceptron. The other notable machine learning models include naïve Bayes (NB) and random forests (RF) 7 . These machine learning models have been studied for a long time and have shown superior performances across a variety of tasks.
Usually, these models require a large, well-balanced sample dataset to assure prediction accuracy 8 . However, in practice the proportion of real data is often biased. For example, over 90% of emails were identified spam mails in 2012 9 while the common training datasets such as SpamAssassin 10 and Ling-Spam 11 consist only 20-30% of spam labeled data and rests are ham labeled data. Namely, real data is more likely to be imbalanced. Also, spam emails have become difficult to detect since their form dramatically changed 12 . Therefore, the data proportion and property between real data and common datasets as a testbed for machine learning study might have much difference, and we thus consider that there is a strong need for the machine learning model which can deal such situation.
In contrast, humans can generalize a new concept from small and biased samples 13,14 . For example, by seeing a hippopotamus in a zoo for the first time, an infant can obtain a lot of information about the new object: what it looks like, how big it is, and the characteristics that differentiate hippos from other animals. In machine learning, hundreds or thousands of training samples may be required to tackle the same problems 15 . Also, humans do not need a large number of negative samples to learn a positive instance 15 . For example, infants do not need to see elephants, to learn hippos. Namely, human can generalize a new concept from samples of single class, while machine learning requires many data and many labels.
In spam classification, each message is characterized as an n-dimensional word feature vector = ... W w w w , , , n 1 2 that is predefined to belong to class = ∈ C spam ham c C , , i , where ham is non-spam. The posterior probability that W belongs to c i is given as defined in equation (1).
is called "evidence" and takes the same value for all classes and does not affect the relative values of their probability 43 . It can be ignored, as in equation (2).
The NB classifier assumes that every feature in a text is conditionally independent and that each distribution is estimated as a one-dimensional distribution 1 . In practice, this assumption is unrealistic because words such as "money" and "casino" are likely to co-occur in the same text. However, this assumption of conditional independence greatly simplifies the calculation and yields superior performance in text classification.
Loosely symmetric naïve Bayes. We implemented the LSNB model based on the naïve Bayes classifier. . If both terms equal zero, the model is equivalent to conditional probability, namely, there is no bias. If b = c is satisfied, equations (3) and (5) are always equivalent and the model has a complete symmetric bias. Additionally, if a = d and b = c are simultaneously satisfied, equations (3 and 5) and (6) are equivalent and the model has complete symmetric and mutually exclusive biases. Figure 1, which shows the relation between LS(q|p) and LS(p|q) as well as the relation between LS(q|p) and | LS q p ( ), demonstrates this. The data points in the figure are randomly generated by uniformly estimating a, b, c and d from [0, 1]. If the bias is complete, then LS(q|p) = LS(p|q) and | = | LS q p LS q p ( ) ( ) hold, and the graphs have a positive and proportional relationship. If there is no bias, there is no correlation between LS(q|p) and LS(p|q) or between LS(q|p) and | LS q p ( ), so the distribution of the plots in Fig. 1 would be random. The distributions of the plots in Fig. 1 show an intermediate shape: a hybrid proportional and random distribution. It seems trivial, however if the model always completes symmetric bias or mutually exclusive bias, the model would be illogical and does not show similarity to human inference. ΔP and DH models always complete either bias and conditional probability does not involve any biases 22,29 . Meanwhile, the LS model flexibly adjusts the weights of each type of bias using the terms   29 .
In order to apply the formula to an NB approach, the 2 × 2 contingency table is arranged as in Table 2 and LSNB is calculated as in equations (7-10). = | are the probabilities of co-occurrence in the presence or absence, respectively, of w j in class c i . Each probability is given by equations (7)(8)(9)(10), and the modified weight of word w j in class c i is calculated as in equations (11)(12)(13).

LS i i LS i
For example, if some words such as "money" and "casino" were only observed in spam labeled texts much more frequently from ham labeled texts, these words should be considered spam related words, and vice-versa; the settings above reflect such situation. The main difference between NB and LSNB is the way in which they calculate the posterior probability; in the NB calculation process, the likelihood  (14).
In the above formula, N c w ( , ) i j is the number of times word w j occurs in class c i and WD c w ( , ) i j is the word density. The word density is used as a confidence measure in many text classification applications 44 . We developed this model to more optimally adjust the weights of each feature. For example, if word w j is frequently observed in spam texts but infrequently in ham texts, then word w j should be more strongly considered to be a spam-related word. The reason why we employed word density information into eLSNB is that external bias can be effectively employed in some practical cases. The causal relationship is sometimes difficult to estimate from observed raw data. In such a condition, the eLSNB model introduces bias into the features and flexibly modifies its weight as shown in equations (15)(16)(17)(18).
After the weight modifications, the eLSNB model calculates the likelihood and the posterior probability as in equations (11)(12)(13).
Email corpus. We used two publicly available English email corpora in the experiment. The SpamAssassin 10 corpus consists of 3900 ham messages and 1897 spam messages, and 33% of the sample data were spam. The Ling-Spam 11 corpus consists of 2412 ham messages and 481 spam messages. We used the lemm version (texts are lemmatized) of the Ling-Spam corpus for the experiment, and 17% of the messages were spam. Experimental Settings. We conducted two experiments with different percentages of spam and ham messages in the learning phase, using seven classification models: SVM, NN, LR, RF, NB, LSNB and eLSNB. The SVM classifier was used with a Gaussian kernel, which is common for text classification with = . cost 0 1, = . gamma 0 1. We used a three-layered NN with a sigmoid function, which is common for binary classification. The number of nodes in a hidden layer was 10. The number of nodes seems few, but we found this value is suitable for the following experiments. The LR used binominal regression with α = 0.1. RF used 300 trees, and the number of features for the decision split was the square root of the dimensions of the feature space. For the experimental settings of the NB, LSNB and eLSNB models, we set prior probabilities to be equal for each class, namely, = . P spam ( ) 0 5 and = . P ham ( ) 0 5, to avoid any initial asymmetry. Half of the whole dataset was used as test data. The parameters of each model were decided after some trials and chosen best values for the experiments.
In the following experiments, we used only biased and skewed numbers of training data. In Exp. ]. The settings used for Exp. 2-2 were the inverse of those used in Exp. 2-1. In particular, Exp. 1 was an investigation how biased data would affect machine learning model and Exp. 2 was an investigation how small data also affects the performances.
Before the experiments were conducted, we eliminated punctuation, numerals and stop words 45 from email texts as well as any word features that were observed only once. Stop words are English words commonly observed in any general text, such as "this" and "you". These words do not affect classification and are often eliminated before the training phase. Furthermore, according to the theory of burstiness 46,47 , words related to the text content tend to be observed more than once. Thus, we eliminated words from the feature vector that were observed less than twice. All experiments were implemented using R (https://www.r-project.org). We used the e1071 package for SVM, the nnet package for NN, the glmnet package for LR, and the randomForest package for RF. NB, LSNB and eLSNB models were implemented within the R statistical computing environment using custom scripts.

Results
We compared the performance of NB, LSNB, eLSNB, SVM, NN, LR and RF on spam classification. The purpose of the task was to classify texts into one of two classes, spam and ham. We used two mail corpora, SpamAssassin and Ling-Spam, in the following experiments. Experiment 1. In the following experiments, we varied the percentage of spam training data in each experiment and compared the spam classification accuracy, ham classification accuracy and F-measure. Experiment 1-1. The results of Exp. 1-1 are shown in Fig. 2. Overall, eLSNB, LSNB, NN and RF methods achieved higher classification accuracy for spam classification. NB did not improve the spam classification accuracy through the experiment and showed relatively lower accuracy. In the ham classification, eLSNB, LSNB, NB and NN showed higher classification accuracy compared to the other models. LR did not improve the ham classification accuracy through the experiment and was the worst among all the models.
In this experiment, the total number of training data was less than 500. Therefore, each classification model was expected to have difficulty in optimizing the proper weights for each feature. However, the eLSNB, LSNB, NN and RF yielded higher F-measure scores. Additionally, eLSNB and LSNB models exhibited improved spam and ham accuracies compare to the NB base model. Overall, eLSNB performed the best in terms of F-measure, supported by its use of the LS model and word density information for optimizing the feature weights from a small number of samples. Experiment 1-2. The results of Exp. 1-2 are shown in Fig. 3. Although the percentage of spam training data in this experiment was higher than that in Exp. 1-1, almost all the models yielded similar results. For example, NB, LSNB, eLSNB and NN models performed similarly throughout Exp. 1. Meanwhile, RF, LR and SVM showed some trade-offs between spam and ham accuracies. These models improved ham classification accuracy and decreased spam classification accuracy compared with the results in Exp. 1-1. Therefore, the performances of RF, LR and SVM were affected by the spam percentage and expected to have some sensitivity to imbalanced data.  Additionally, the trade-offs of RF, LR and SVM became wider in Exp. 1-3 relative to Exp. 1-2. Furthermore, NN showed a trade-off that was not observed in Exp. 1-1 or 1-2. Meanwhile, the eLSNB, LSNB and NB did not show such a trade-off and produced higher spam classification accuracy. NB, LSNB and eLSNB did not appear to be affected by changes in the class distributions. The proposed models each outperformed the NB base model. Thus, eLSNB and LSNB approaches had some advantage under the biased class distribution in comparison with other models, somewhat resembling the fast learning that is characteristic of humans. Overall, Exp. 1 showed that the LSNB and eLSNB methods simply produced higher classification accuracy than NB and had the highest performance in terms of F-measure.

Experiment 2.
In the following experiments, the number of training data of either class was predefined to be a constant value. . When the spam training data contained 100 spam messages, all models showed increased ham classification accuracy. Meanwhile, NN, LR, RF and SVM showed decreased spam classification accuracy as the training data increased in number. Thus, these models showed some sensitivity to the data distribution, owing to the lack of spam relative to ham in the training data. If the number of spam messages in the training data is large enough relative to ham, each machine learning models is able to estimate the proper weights for each feature. However, in this experiment, the feature distributions between spam and ham were strongly biased. NN, LR, RF and SVM could not properly weight each feature and spam classification accuracy decreased as the number of ham training data increased. Meanwhile, LSNB and eLSNB did not decrease either spam or ham classification accuracies. Overall, eLSNB produced superior results in terms of F-measure.
When the number of spam training data was predefined as 25, RF, SVM and NN performed better in ham classification. However, as in Exp. 2-1, these models did not show superior results on spam classification. In particular, SVM showed higher ham classification accuracy, though the spam classification accuracy was the worst among all the models. The NN, RF and LR also showed similar trade-offs. LSNB was also affected by the biased sample data, which was not observed in Exp. 1. Although LSNB produced higher ham scores, the spam classification accuracy decreased as the sample dataset increased in size. Meanwhile, NB and eLSNB did not show  Fig. 7 (where the number of ham training data was predefined to 100) and Fig. 8 (where the number of ham training data was predefined to 25). When there were 100 ham training data, almost all the models increased spam classification accuracy as the size of the training data increased. Meanwhile NN, SVM, LR and RF models decreased in ham classification accuracy through the experiment. This suggests that these models also showed some sensitivity, as seen in the results of Exp. 2-1.
When there were 25 ham training data points, eLSNB and NB had superior performance in ham classification. Meanwhile, NN, LR, RF and SVM did not show higher performance, in spite of these models having had higher ham classification accuracy in Exps 1 and 2-1. Also, NN, LR, SVM and RF had the best performance in spam classification. The data proportions between Exps 2-1 and 2-2 were symmetric, and therefore the spam classification results of Exp. 2-1 and ham classification results in Exp. 2-2 were similar. The results might not show such symmetry if the data properties or feature distributions were very different between the spam and ham training data. Since most models showed trade-offs as the dataset increased in size, these models had some sensitivities to imbalances in the data distributions. As Exps 1 and 2 showed, SVM, LR, RF and NN were strongly affected by the data ratio. NB did not show such strong trade-off, but its performance was relatively lower. The proposed LSNB model showed a trade-off in Exp. 2, and the bias adjustment of the model failed somewhat in some cases. Meanwhile, eLSNB overcame this weakness and word density information helped to prevent the problematic data sensitivity. Overall, eLSNB had the highest F-measure values.

Discussion
The present study tested the performance of NB, SVM, NN, LR and RF machine learning methods against our models, designated LSNB and eLSNB, using small and biased samples. We focused on the classic spam classification task, which has been studied for a long time in the field of machine learning. The data proportion and contexts between real spam mail data and common spam classification datasets have much difference, and the machine learning model which can deal such situation is strongly needed. The conventional algorithms, such as NB, NN, SVM, LR and RF, often require a large amount of well-balanced sample data to assure prediction accuracy in tasks such as spam classification. In contrast, humans can generalize a new concept from a small number  [13][14][15] . Some researchers claim that human beings have cognitive biases [16][17][18][19][20] and that these biases facilitate concept learning from small and biased samples 21,22 . We developed LSNB and eLSNB based on this hypothesis and attempted to reproduce this small and biased sample scenario properly as a machine learning task. The difference between NB and our models is that LSNB and eLSNB include two additional terms + ac a c and + bd b d , which modify the probabilities of the models. As shown in the Methods section, these two terms adjust the effectiveness of symmetric bias and mutually exclusive bias; in other words, they promote concept learning, but do not always make correct inferences.
In the experiments, we tested the models using different percentages of spam and ham data in the learning phase to investigate how model behaviors changed according to changes in the feature distribution. In Exp. 1-1, we used the same numbers of spam and ham training data points, and Exps 1-2 and 1-3 used less spam and more ham data (33% spam in Exp. 1-2 and 17% in Exp. 1-3). These three experiments were investigations of how biased data would affect the performances of machine learning models.
In Exp. 1-1, every model showed higher classification accuracy on spam and ham and most models increased their performance with the incensement of data size. However, SVM, LR and NB showed relatively lower spam and ham classification accuracies. The total number of training data in this experiment was less than 500. Therefore these models did not perform well from such a small number of training data. In contrast, LSNB and eLSNB simply improved upon the performance of the NB base model, producing superior results. The class distribution of this experiment was equal between spam and ham. Therefore trade-off was not observed from every model.
In Exp. 1-2, SVM, LR and RF showed trade-offs between spam and ham classifications. The spam classification accuracies of these models were relatively lower at the initial stage and gradually increased throughout the experiment. Meanwhile their ham classification performances merely increased and showed similar F-measure scores as shown in Exp. 1-1. Although RF showed higher F-measure scores, its sensitivity to class distributions was observed. In practice, NB, SVM, LR and RF often require a large amount of training data to assure the prediction accuracy. However, we only used limited number of training data in this study. These models thus showed less ability of learning from limited number of data. Meanwhile eLSNB, LSNB and NN did not show trade-offs and kept higher classification performances.
In Exp. 1-3, SVM and RF showed bigger trade-offs. These models showed almost perfect ham classification performances, even when the number of training data was small. However, their spam accuracies were very low. Although their spam accuracies increased throughout the experiment, these models did not perform as well as other models. Furthermore, NN did not show trade-offs in Exps 1-1 and 1-2, however, its spam accuracy decreased and ham classification performances were merely increased. This fact suggests NN also suffered from small biased training data. eLSNB, LSNB and NB did not show such trade-offs and showed higher spam and ham classification performances. The spam accuracy of NB was relatively lower. Meanwhile LSNB and eLSNB increased their performance compare to its base model NB.
In Exp. 1, each model showed interesting tendency for the data ratios. NN, SVM, LR and RF showed trade-offs between spam and ham classification accuracies. In particular, the trade-offs of these models became bigger and bigger as the spam percentage decreased. Therefore these models exhibited some sensitivity to the feature distribution and their accuracies have widely fluctuated. NB did not show such a trade-off; however, its classification performance was relatively low. In contrast, LSNB and eLSNB simply improved upon the performance of the NB base model, producing superior results. LSNB and eLSNB adjust feature weights using feature vectors for each class, while NB simply calculates the product of the conditional probability. We consider that this modification yielded better learning from small and biased samples, and eLSNB produced the best performance in terms of its F-measure. The eLSNB model is a modified version of the LSNB model that uses word density information. This modification successfully improved the learning process.
In Exp. 2, we investigated the effect of more imbalanced sample distributions on machine learning models. We predefined the number of training data of either class at a constant value, i.e., 100 or 25. Therefore, the disparity in the number of training data between spam class and ham class messages became progressively wider throughout the experiment. Accordingly, the feature distribution of the training data was strongly imbalanced. In this experiment we focused on how small data affect the performances of machine learning models.
In Exp. 2-1, all models had strong trade-offs throughout the experiments and decreased in accuracy as the size of the training data set increased, except for the NB and eLSNB models. For example, the spam classification performances of SVM, LR, RF and NN decreased significantly as the size of ham training data increased. At the initial stage of the experiment, these models had lower ham performances and higher spam performances. If the models were able to optimize their performance under an imbalanced data distribution, such a decrease in accuracy would not be observed. As the data proportions of Exps 2-1 and 2-2 were symmetric, the spam classification results in Exp. 2-1 and ham classification results in Exp. 2-2 were similar. This fact suggests that the composition of feature distributions were symmetric between Exps 2-1 and 2-2. For example, if the spam data is easier to classify than ham data, the results would be asymmetrical and vice-versa. Therefore, there is no initial asymmetry between spam and ham training data. We consider that these trade-offs were not caused by the contents of the corpus, but rather the difference in the number of training data points belonging to each class-in other words, the imbalanced data distribution. SVM, LR, RF and NN were strongly affected by this factor. Also, our LSNB model showed a trade-off even though its NB base model did not decrease in performance through the experiment. We cannot explain the exact reason why LSNB showed such a trade-off, but we assume that LSNB may not fully adjust to the effects of symmetric and mutually exclusive biases. Although NB did not exhibit strong trade-off, its performance was relatively low. Additionally, we roughly estimate that there is a difference in characteristics between the NB and NN; NB did not show a trade-off but its classification performance was relatively low, while NN showed higher performance in terms of F-measure, but it had a strong trade-off. In contrast, eLSNB did not show such a trade-off and consistently produced the best F-measure. The inclusion of word density information in the eLSNB model appeared to overcome the data sensitivity of the base LSNB model. In practice, as a form of eLSNB, word density strengthened the contraposition of feature values in the 2 × 2 contingency table. As previous studies have indicated, human cognitive biases play a key role in the ability to learn from small and biased samples. However, we assume that human cognitive biases themselves are not powerful enough to produce human-level concept leaning, and additional biases, such as word density, may be needed. Since the relationship between cause and effect is sometimes difficult to infer from observed raw data, external biases may promote concept learning in models, even if it is not derived from human cognition directly.
In conclusion, we developed LSNB and eLSNB models that include symmetric bias and mutually exclusive bias by implementing the LS model into a base NB model. These novel models were successful, yielding higher performance compared with existing representative machine learning algorithms with small and biased samples. Our models seem to have reproduced the ability of human learning to some extent. In future research, we will investigate the relationship between conditional probability, human cognitive bias, the effectiveness of external bias and how these factors interact in the learning process in order to realize human-level concept learning.