Towards improving e-commerce customer review analysis for sentiment detection

According to a report published by Business Wire, the market value of e-commerce reached US$ 13 trillion and is expected to reach US$ 55.6 trillion by 2027. In this rapidly growing market, product and service reviews can influence our purchasing decisions. It is challenging to manually evaluate reviews to make decisions and examine business models. However, users can examine and automate this process with Natural Language Processing (NLP). NLP is a well-known technique for evaluating and extracting information from written or audible texts. NLP research investigates the social architecture of societies. This article analyses the Amazon dataset using various combinations of voice components and deep learning. The suggested module focuses on identifying sentences as ‘Positive‘, ‘Neutral‘, ‘Negative‘, or ‘Indifferent‘. It analyses the data and labels the ‘better’ and ‘worse’ assumptions as positive and negative, respectively. With the expansion of the internet and e-commerce websites over the past decade, consumers now have a vast selection of products within the same domain, and NLP plays a vital part in classifying products based on evaluations. It is possible to predict sponsored and unpaid reviews using NLP with Machine Learning. This article examined various Machine Learning algorithms for predicting the sentiment of e-commerce website reviews. The automation achieves a maximum validation accuracy of 79.83% when using Fast Text as word embedding and the Multi-channel Convolution Neural Network.

In order to achieve the common aim of automation within the research community, adequate scientific literature understanding is essential. It has been calculated that 8-9% of the total research volume generated each year is increasing. An overabundance of knowledge leads to the 'reinventing the wheel' syndrome, which has an impact on the literature review process. Thus, scientific progress is hampered at the frontier of knowledge, where NLP can solve many problems. Analysis of customer feedback can be challenging due to the high level of qualitative nuance contained within the material and the vast volume of data obtained by businesses. Because qualitative comments, reviews, and free text are more difficult to quantify than quantitative feedback 1 , evaluating them may be more difficult. Natural Language Processing and Machine Learning will one day be able to process large amounts of text without the need for human intervention.
Text Clustering and Topic Modelling are the two methods utilized most frequently to recognize topics included within a text corpus 2 . Text pre-processing is essential to natural language processing because it takes the text and converts it into a form that is easier to understand and works with different AI techniques, allowing machine learning algorithms to function more effectively.
As previously stated, understanding and analysing reviews is critical for making purchasing decisions. Both negative and positive evaluations are equally important. A research report 3 indicated that 82 % of customers who purchase things intentionally seek negative reviews. With a 13 trillion economy in the online marketplace and the peer effect, reviews play a significant role in deciding what to buy and what not to buy. With the help of NLP, users can automate the process of analyzing the reviews. This paper examines various Machine Learning algorithms for predicting the sentiment of e-commerce website reviews. The main contributions of this work are: • Collection of raw dataset reviews that are publicly available. It contains Amazon product reviews as well as metadata. • Data pre-processing and review analysis to provide insights into the various word vector representations.
• Examined various Machine Learning and Deep Learning models with different Word Embedding approaches, such as BERT, Glove, Elmo, and Fast Text, to predict the sentiment of e-commerce website reviews.
The remainder of the paper is structured as follows. Section "Related work" discusses the background, section "Methodology" discusses related works methodology, and section "Experimental analysis and Results" discusses the result, followed by the conclusion and future work.

Baselines.
We have studied machine learning models using various word embedding approaches and combined our findings with natural language processing. During the analysis phase, the priority is predominantly on providing more detail about the operations performed on the dataset by BERT, Glove, Elmo, and Fast Text. An investigated was performed on wide range of combinations of NLP and deep learning strategies, as well as methodologies considered to be cutting-edge. In order to build the best possible mixture, it is necessary to integrate several different strategies. It is necessary to integrate several different strategies in order to create the best possible mixture. All models cannot integrate with deep learning techniques at their initial level because all of the procedures need to be revised. We need to redesign the techniques mentioned to achieve better results.

Related work
The qualitative quality of the data and the enormous feedback volume are two obstacles in conducting customer feedback analysis. The analysis of textual comments, reviews, and unstructured text is far more complicated than the analysis of quantitative ratings, which can be done because ratings are quantitative. Nowadays, with the help of Natural Language Processing and Machine Learning, it is possible to process enormous amounts of text effectively without the assistance of humans. In this regards, Kongthon et al. 4 implemented the online tax system using natural language processing and artificial intelligence. They have used NLP to secure future scenarios. The majority of high-level natural language processing applications concern factors emulating thoughtful behavior.
To use a very large target vocabulary without increasing training complexity, Jean et al. 5 propose a system based on consequence sampling that allows us to operate a large-scale vocabulary without increasing training complexity of the Neural Machine Translation (NMT) model. However, Refining, producing, or approaching a practical method of NLP can be difficult. As a result, several researchers 6 have used Convolution Neural Network (CNN) for NLP, which outperforms Machine Learning. However, the majority of current research focuses on learning dependency information from contextual words to aspect words based on the sentence's dependency tree, which does not take advantage of contextual affective knowledge with regard to the specific aspect. Liang et al. 7 propose a SenticNet-based graph convolutional network to leverage the affective dependencies of the sentence based on the specific aspect. Specifically, the authors build graph neural networks by integrating SenticNet's affective knowledge to improve sentence dependency graphs. Emma Strubell et al. 8 , in their research work, when authors have used large amounts of unlabeled data. It has been observed that NLP in combination with a neural network model yielded good accuracy results, and the cost of computational resources determines the accuracy improvement. Based on extensive research, the author has also made some cost-cutting recommendations.
Similarly, the data from accounting, auditing, and finance domains are being analyzed using NLP to gain insight and inference for knowledge creation. Fisher et al. 9  www.nature.com/scientificreports/ domain and provided future paths. Apart from these, Vinyals et al. 10 have developed a new strategy for solving the problem of variable-size output dictionaries. NLP-based techniques have been used in standardized dialog-based systems such as Chat boxes 11 . Also, Text Analytics is the most commonly used area where NLP is frequently used 12 . Machine learning algorithms with NLP can be used for further objectives like translating, summarizing, and extracting data, but with high computational costs.
Deep learning 13 has been seen playing an important role in predicting diseases like COVID-19 and other diseases 14,15 in the current pandemic. A detailed theoretical aspect is presented in the textbook 16 17 is one of these models that employs a transformer, an attention mechanism that understands the meaning of ambiguous language in text by using surrounding text (words (or sub-words) to establish context. The Stanford Question Answering Dataset (SQUAD), a dataset constructed expressly for this job, is one of BERT's fine-tuned tasks in the original BERT paper. The SQUAD is made up of a variety of English-language literature. Questions about the data set's documents are answered by extracts from those documents. Many engineers adapted the BERT model's original architecture after its first release to create their unique versions.
GloVe 18 is a learning algorithm that does not require supervision and produces vector representations for words. The training is done on aggregated global word-word co-occurrence information taken from a corpus, and the representations produced as a result highlight intriguing linear substructures of the word vector space. ELMo 19 is an example of a deeply contextualized word representation that represents the intricate properties of word use (such as syntax and semantics) and the ways in which these uses vary across different language contexts (i.e., to model polysemy). These word vectors are learned functions generated from the internal states of a deep bidirectional language model (biLM), which has been pre-trained using a substantial text corpus. They may be integrated into existing models and considerably advance the state-of-the-art in a wide variety of complex natural language processing tasks, such as question answering, textual entailment, and sentiment analysis.
The polarity determination of text in sentiment analysis is one of the significant tasks of NLP-based techniques. To determine polarity, researchers employed unsupervised and repeatable sub-symbolic approaches such as auto-regressive language models and turned spoken language into a type of protolanguage 20 . Polarity is a compelling idea for comprehending the grey region of sentiments. To further improve sentiment analysis, Trueman et al. 21 proposed a convolution-stacked bidirectional long-term memory with a multiplicative attention method for detecting aspect categories and sentiment polarity. Affective Computing and Sentimental analysis comprising human-computer interaction, machine learning, and multi-model signal processing has been proposed 22 for capturing the meaning of people's sentiments from social media platforms. The sentiments collected sometimes suffer from imbalanced data and insufficient data. The problem of insufficient and imbalanced data is addressed by the meta-based self-training method with a meta-weighter (MSM) 23 . The MSM model is based on neurosymbolic learning systems. An analysis was also performed to check the bias of the pre-trained learning model for sentimental analysis and emotion detection 24 . Table 1 summarises several relevant articles and research papers on review analysis.

Methodology
The block diagram of the overall methodology used for sentiment detection in reviews is shown in Figure 1. Three major steps are taken in order to detect sentiment in reviews: 1. Data pre-processing, 2. Word embedding, and 3. Models employed. www.nature.com/scientificreports/ Pre-processing of data. Data mining is essential in NLP, and data pre-processing is crucial in model construction. Pre-processing data removes ambiguity and redundancy. To implement machine learning and deep learning algorithms, NLP requires specific text input pre-processing. Various methods are used to convert textual data into a format suitable for modeling. Data pre-processing techniques are critical in designing an NLP model that focuses only on the important parts of the text. The following are the fundamental pre-processing techniques: Punctuation removal. Commas and other punctuation may not be necessary for understanding the sentence's meaning, so they are removed.
Stop words removal. Stops Words (Words that connect other words and don't provide a wider context) can be ignored and screened from the text as they are more standard and contain less useful knowledge. For example, conjunctions like 'and' , 'or' and 'but' , prepositions like 'in' , 'of ' , 'to' , 'from' , and many others like the articles like 'a' , 'an' , and 'the' .
Lemmatization. The process of grouping related word forms that are from the exact words is known as Lemmatization, and with Lemmatization, we analyze those words as a single word.
Word embedding. The pre-processed data is now used for creating bag of word vectors by using different word embedding techniques namely, (i) Bidirectional Encoder Representations from Transformers (BERT), (ii) Embedding from Language Model (ELMo), (iii) Global Vectors for Word Representations (GloVe) and (iv) FASTTEXT.

Bidirectional encoder representations from transformers (BERT).
BERT is an innovative model which applies bidirectional training of transformers. BERT uses Transformers, and it learns the relation between a word to another word (or sub-words) in the given text of contextual nature. In its initial form, BERT contains two particular tools, an encoder for reading the text input and a decoder for the prediction. Since BERT aims to forge a language model, the encoder phase is only necessary. Figure 2 is an illustration of BERT representation.
Embedding from Language model (ELMo). ELMo 31 is an abbreviation for 'Embedding from Language Model' , a method for representing a sequence of words as vectors. The shortcomings of Gloves and other static pretrained embedding models give rise to the concept of ELMo. When compared to the Glove, ELMo is a different analogical embedding. ELMo vectors are used to improve the accuracy or classification of any NLP task. ELMo can fairly classify the meaning of the same word in different sentences, mentioning different contexts. ELMo architecture is a fairly broad architecture consisting of LSTM layers. As a result, language model training is accomplished effectively using the ELMo architecture. It can be represented as follows: • Contextual: Each word represented in a sentence depends on the whole context in which it is used.

Data Set
Data Pre-processing • Character-based: ELMo allows the network to use the semantic clue to form a robust representation.
Global vectors for word representations (GloVe). GloVe 32 is a distributed word representation model derived from Global Vectors. The GloVe model is an excellent tool for discovering associations between cities, countries, synonyms, and complementary products. SpaCy creates feature vectors using the cosine similarity and euclidean distance approaches to match related and distant words. It can also be used as a framework for word representation to detect psychological stress in online or offline interviews. GloVe is an unsupervised learning example for acquiring vector representations of words. It collects and aggregates global word-to-word co-occurrences from the corpus for training, and it returns a linear substructure of all word vectors in a given space. Convolutional neural network (CNN). The CNN model used is a five-layer sequential model. The architecture consists of an input layer of size equal to length. The second layer is the embedding layer, which is applied to the primary layer and contains 100 neurons. The subsequent layers consist of a 1D convolutional layer on top of the embedding layer having a filter size of 32, a kernel size of 4 with the 'ReLU' activation function. After the 1D convolutional layer, the global max pool 1D layer is used for pooling. After getting the output from the pooling layer, two dense layers are used, with the penultimate layer having 24 neurons and a 'ReLU' activation function and a final output layer with one neuron and a 'sigmoid' activation function. Finally, the above model is compiled using the 'binary_crossentropy' loss function, Adam optimizer, and accuracy metrics.
Bidirectional LSTM (BiLSTM). The LSTM model used is a four-layer sequential model. The architecture consists of an input layer with size equal to length. The input layer is routed through the second layer, the embedding layer, which has 100 neurons and a vocabulary size of 100. The output of the second layer is routed through a 100-neuron bidirectional LSTM layer. The output from the bidirectional layer is passed into two dense layers, with the first layer having 24 neurons and a 'ReLU' activation function and a final output layer with one neuron www.nature.com/scientificreports/ and a 'sigmoid' activation function. Finally, the above model is compiled using the 'binary_crossentropy' loss function, adam optimizer, and accuracy metrics. After that, Multi-channel CNN was used, which is quite similar to the previous model. Figure 3 is an illustration of BiLSTM.
Multi-channel CNN. The model used in the paper consists of three channels. All three channels represent the same architecture, with channel one architecture consisting of input1 with shape equal to length, the second layer being an embedding layer applied to the first layer with vocab size and 100 neurons, followed by a Conv1D layer with filter size of 32, kernel size of 4, and activation function 'ReLU' . Dropout layer is added to the top of the Conv1D layer with the dropout value of 0.5; after that, max-pooling layer is added with the pooling size of 2; after that result is flattened and stored in the flat one layer. Similarly, channels 2 & 3 have the same sequence of layers applied with the same attribute values used in channel 1. The results of channel 2 & channel 3 are flattened and stored into flat 2 & flat three layers consecutively. The output stored in flat 1, flat 2 & flat three is finally concatenated and stored in the merged layer. After getting the output from the merged layer, two dense layers have been used. The 1st dense layer contains ten neurons with activation function as 'ReLU' & it is again followed by another dense layer with one node & the activation function used is 'Sigmoid' . Finally, a model is formed using input1, input2 & input3 & outputs given by the last dense layer. The model is compiled using the loss function as binary cross-entropy, ADAM optimizer & accuracy matrices. The architecture is shown in Figure 4.

Random multi-model deep learning (RMDL). RMDL is a new deep learning technique for classification that
can accept text, video, images, and symbols as input. RMDL includes Random models as shown in Fig. 5, which having three components:  The RMDL model used is sequential with five layers. The architecture consists of an input layer with size is the length. After the input layer, the second layer is the embedding layer with vocab size and 100 neurons. The third layer consists of a 1D convolutional layer on top of the embedding layer with a filter size of 128, kernel size of 5 with the 'ReLU' activation function. The fourth layer used is bidirectional LSTM with 32 neurons. The output from the bidirectional layer is passed into two dense layers, with the first layer having 24 neurons and 'ReLU' activation function and a final output layer with one neuron and 'sigmoid' activation function. Finally, the above model is compiled using the 'binary_crossentropy' loss function, adam optimizer and accuracy metrics.

Experimental analysis and results
This section describes and analyses the dataset description, experimental setup, and experiment results.

Dataset description. The dataset used in this work is an Amazon product review dataset obtained from
Kaggle. The dataset contains following entities as columns.
• Id: Unique id of the product (34,660) • Name: Name of the product • Brands: Brand of product e.g., Amazon • Categories: Category of product e.g., Electronics etc • Reviews Text: Reviews given by customers about product • Rating: Customers feedback on the product (Range from 1 to 5) There are 34,660 samples in this dataset. First, useful features are extracted, and features with high null values are removed from the table because they have no role in prediction. The final dataset only has two columns: review text and rating. The ratings are labelled as either Negative (0) or Positive (1). Ratings greater than or equal to 3 are considered positive, while ratings less than 3 are considered negative. Table 2 gives the details of experimental set up for performing simulation for the proposed work.

Experimental setup.
Results and discussion. The preprocessed data is split into 75% training set and 25% testing data set. The To find the training accuracy, trainX was used as training sample input, and train labels as predictive labels (Positive, Negative) & verbose was kept as 0. The training accuracy of 98.83% was achieved. To find the testing accuracy, testX was used as testing sample input and validation labels as predictive labels (Positive, Negative) & verbose was kept as 0; the testing accuracy of 72.46 % was achieved. Figure 8a shows the confusion matrix    Figure 12c shows the confusion matrix formed by the FastText plus Multi-channel  Table 3 shows the classification report against y_test and predictions. The target names are classified as 0 & 1. From the figure, it can see that F1-Score, which is the harmonic mean of precision & recall, has a value of 74 %. Figure 13a represents the graph of model accuracy when the FastText plus RMDL model is applied. In the figure, the blue line represents training accuracy, and the red line represents validation accuracy. Figure 13b represents the graph of model loss when the FastText plus RMDL model is applied. In the figure, the blue line represents training loss & the red line represents validation loss. The total positively predicted samples, which are already positive out of 27,727, are 17,883 & negative predicted samples are 3037. Similarly, true negative samples are 5620 & false negative samples are 1187.
As it is well known, a sentence is made up of various parts of speech (POS), and each combination yields a different accuracy rate. The validation accuracy of various models is shown in Table 4 for various text classifiers. Among all Multi-channel CNN (Fast text) models with FastText, the classifier gives around 80% validation accuracy rate, followed by LSTM (BERT), RMDL (BERT), and RMDL (ELMo) models giving 78% validation accuracy rate. Table 4 shows the overall result of all the models that has been used, including accuracy, loss, validation accuracy, and validation loss.

Neutrality in classification.
Neutrality is addressed in various ways depending on the approach employed.
In lexicon-based approaches 34 , the word neutrality score is used to either identify neutral thoughts or filter them out so that algorithms can focus mainly on positive and negative sentiments. However, when statistical methods are used, the way neutrals are treated changes dramatically. www.nature.com/scientificreports/ Although, some researchers 35 filter out the more numerous objective (neutral) phrases in the text and only evaluate and prioritise subjective assertions for better binary categorization. There is a widespread belief that neutral texts provide less guidance than those that make overtly positive or negative statements. As a result, in academic articles of sentiment analysis that employ statistical methodologies, researchers generally prefer to   www.nature.com/scientificreports/ ignore the neutral category because they assume neutral texts are around the boundary of the binary classifier.
In this article, we did not consider neutrality.

Conclusion
This article explored customer review analysis using the Amazon dataset and tested four well-known supervised classifiers. Critical grammatical sections have also been evaluated and investigated. It has been established that, of all the potential combinations of the various parts of speech, the most effective combination consists of a verb, an adverb, and an adjective. Evaluating the quality of online items relies on the positive or negative classification of remarks. As it is generally known that a sentence consists of a variety of distinct elements of speech, the many types provide a spectrum of differing degrees of accuracy. Table 1 illustrates the efficiency of various models, which compares many text classifiers, and presents the validation accuracy of various models. Among all of the models, the Multi-channel CNN (Fast text) model with fast text classifier offers about an 80% validation accuracy rate, followed by the LSTM (BERT), RMDL (BERT), and RMDL (ELMo) models, providing a 78% validation accuracy rate. This article is working on developing a fair and effective technique that will also integrate the neutrality of the reviews to enhance the analysis.  www.nature.com/scientificreports/

Data availability
The datasets used and/or analysed during the current study available from the corresponding author on reasonable request.  www.nature.com/scientificreports/ Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.