Decoding violence against women: analysing harassment in middle eastern literature with machine learning and sentiment analysis

Low, Hui Qi; Keikhosrokiani, Pantea; Pourya Asl, Moussa

doi:10.1057/s41599-024-02908-7

Download PDF

Article
Open access
Published: 10 April 2024

Decoding violence against women: analysing harassment in middle eastern literature with machine learning and sentiment analysis

Humanities and Social Sciences Communications volume 11, Article number: 497 (2024) Cite this article

480 Accesses
1 Altmetric
Metrics details

Subjects

Abstract

The rising prevalence of harassment in Middle Eastern countries is mirrored in literary works from the region. However, extracting data from these texts to understand the typology and frequency of the cases poses a significant challenge due to human cognitive limitations and potential biases. Thus, this study aims to use natural language processing (NLP) approaches to propose a machine learning framework for text mining of sexual harassment content in literary texts. The data source for this study consists of twelve Middle Eastern novels. The proposed framework involves the classification of physical and non-physical types of sexual harassment using a machine-learning model. Lexicon-based sentiment and emotion detection are applied to sentences containing instances of sexual harassment for data labelling and analysis. Finally, a long short-term memory-gated recurrent unit (LSTM-GRU) deep learning model is built to classify the sentiment characteristics that induce sexual harassment. The proposed model achieved an accuracy of 75.8% while outperforming five other algorithms. Additionally, a sentiment classification with three labels—negative, positive, and neutral—was developed using an LSTM-GRU RNN deep learning model. The accuracy of this model was 84.5%. Most statements, even those involving physical sexual harassment, which had greater levels of sexual harassment, had negative sentiments, according to lexicon-based sentiment analysis. This study contributes to the field of text mining by providing a novel approach to identifying instances of sexual harassment in literature in English from the Middle East. The use of machine learning models and sentiment analysis techniques allows for more accurate identification and classification of different types of sexual harassment. Furthermore, this study sheds light on the prevalence of sexual harassment in Middle Eastern countries and highlights the need for further research and action to address this issue.

Amharic political sentiment analysis using deep learning approaches

Article Open access 20 October 2023

Deep cascaded multitask framework for detection of temporal orientation, sentiment and emotion from suicide notes

Article Open access 15 March 2022

Machine learning-based guilt detection in text

Article Open access 15 July 2023

Introduction

The escalating prevalence of sexual harassment cases in Middle Eastern countries has emerged as a pressing concern for governments, policymakers, and human rights activists. In recent years, scholars have made significant strides in advancing our understanding of the typology and frequency of these cases through both empirical and theoretical contributions (Eltahawy, 2015; Ranganathan et al., 2021). Moreover, researchers have sought to supplement their findings by examining evidence from alternative sources such as literary texts and life writings. While literary representations from the region offer valuable insights into individual and collective experiences of sexual harassment, the analysis of these texts to extract relevant data presents considerable challenges due to inherent limitations in human cognitive processes and potential biases (Keikhosrokiani and Pourya Asl, 2023). Consequently, the task of extracting specific content from extensive texts like novels is arduous and time-consuming. The scholarly community has made substantial progress in comprehending the multifaceted nature of sexual harassment cases in the Middle East (Karami et al., 2021). Researchers have conducted rigorous empirical studies that shed light on various aspects of this issue, including its prevalence rates, underlying causes, and societal implications (Bouhlila, 2019). These studies have not only provided valuable statistical data but have also generated theoretical frameworks that enhance our understanding of the complex dynamics at play. In addition to empirical research, scholars have recognized the importance of exploring alternative sources to gain a more comprehensive understanding of sexual harassment in the region. Literary texts and life writings offer unique perspectives on individual experiences and collective narratives related to this issue (Asl, 2023). However, analysing these sources poses significant challenges due to limitations in human cognitive processes. Extracting specific content from large-scale literary works requires meticulous attention to detail and an extensive amount of time. Researchers must carefully navigate through vast amounts of text to identify relevant passages that provide insights into sexual harassment experiences (Ennaji and Sadiqi, 2011). This process is further complicated by potential biases that may influence researchers’ interpretations or choices of which passages to include or exclude.

A hybrid computational method that combines interpretative social analysis and computational techniques has emerged as a powerful approach in digital social research. This method enables the establishment of statistical strategies and facilitates quick prediction, particularly when dealing with large and complex datasets (Lindgren, 2020). To conduct a comprehensive study of social situations, it is crucial to consider the interplay between individuals and their environment. In this regard, emotional experience can serve as a valuable unit of measurement (Lvova et al., 2018). One of the main challenges in traditional manual text analysis is the inconsistency in interpretations resulting from the abundance of information and individual emotional and cognitive biases. Human misinterpretation and subjective interpretation often lead to errors in data analysis (Keikhosrokiani and Asl, 2022; Keikhosrokiani and Pourya Asl, 2023; Ying et al., 2022). To address this issue, hybrid methods that combine manual annotation with computational strategies have been proposed to ensure accurate interpretations are made. However, it is important to acknowledge that computational methods have limitations due to the inherent variability of sociality. Sociality can vary across different dimensions, such as social interaction, social patterns, and social activities within different data ages. Consequently, there are no “general rules” or a universally applicable framework for analysing societies or defining a “general world” (Lindgren, 2020). In this context, text mining emerges as an invaluable tool for efficiently analysing large volumes of data. Its ability to quickly identify patterns and trends related to various phenomena makes it particularly well-suited for investigating issues such as sexual harassment.

Using natural language processing (NLP) approaches, this study proposes a machine learning framework for text mining of sexual harassment content in literary texts. The data source for this study consists of twelve Middle Eastern novels written in English. The proposed framework involves the classification of physical and non-physical types of sexual harassment using a machine-learning model. Additionally, lexicon-based sentiment and emotion detection are applied to sentences containing instances of sexual harassment for data labelling and analysis. Lexicon-based sentiment analysis involves analysing text for positive or negative sentiment using pre-defined lexicons or dictionaries. Emotion analysis involves identifying emotions expressed within text, such as anger or sadness. Finally, an LSTM-GRU deep learning model is built to classify the sentiment characteristics that induce sexual harassment. The neural network approach involves training a model using large datasets to recognize patterns and make predictions based on new data inputs.

The use of machine learning approaches can help to identify patterns within large datasets that may not be immediately apparent through manual analysis. This approach can also help reduce bias by removing human subjectivity from the process of analysis. The use of machine learning models and sentiment analysis techniques allows for more accurate identification and classification of different types of sexual harassment than traditional methods such as manual coding or human annotation. Lexicon-based sentiment and emotion allow for more nuanced analysis by taking into account the emotional context surrounding instances of sexual harassment. Finally, an LSTM-GRU deep learning model allows for a deeper understanding of the underlying factors that contribute to sexual harassment, which can inform future prevention and intervention efforts. The use of lexicon-based sentiment and emotion analysis, as well as a neural network, can help identify patterns and reduce bias in the analysis process. Overall, this study contributes to the field of text mining by providing a novel approach to identifying instances of sexual harassment in literary works from the Middle East. Furthermore, this study sheds light on the prevalence of sexual harassment in Middle Eastern countries, highlighting the need for continued efforts to address this issue.

Background

Sexual harassment types

Sexual harassment is a pervasive issue that can be categorized into three distinct forms: gender harassment, unwanted sexual attention, and sexual coercion. Each category represents a different manifestation of these harmful behaviours, highlighting the various ways in which individuals are subjected to harassment in their personal and professional lives. Gender harassment is a form of discrimination that aims to hinder women from attaining positions of power in traditionally male-dominated fields. It encompasses both verbal and non-verbal conduct that seeks to belittle, demean, or exclude women based on their gender (del Carmen Herrera et al., 2017). This type of harassment perpetuates gender inequality by creating hostile environments that discourage women from fully participating and advancing in their chosen careers. Unwanted sexual attention involves deliberate contact and repetitive requests for data that are intended to attract or express offensive sexual attraction (del Carmen Herrera et al., 2017). This form of harassment often includes unwelcome advances, explicit comments, or inappropriate gestures. It is characterized by the harasser’s persistent pursuit of sexual gratification at the expense of the victim’s comfort and autonomy. Such behaviour not only violates personal boundaries but also creates an intimidating atmosphere that undermines the victim’s sense of safety and well-being. Lastly, sexual coercion occurs when a harasser abuses their position of power to demand sexual favours from a victim in exchange for benefits within a Quid Pro Quo environment (Fateh, 2022). This insidious form of harassment involves leveraging promises of rewards or threats of punishment to manipulate the victim into complying with the harasser’s demands. The power dynamics at play exacerbate the vulnerability of the victim, as they may fear negative consequences if they refuse or report the harassment. It is crucial to recognize that these three categories are not mutually exclusive; they often intersect and coexist within instances of sexual harassment. For instance, gender harassment may lay the foundation for unwanted sexual attention or coercion by perpetuating an environment where such behaviour is normalized or tolerated. Understanding the nuances and complexities of these categories is essential in addressing and combating sexual harassment effectively.

Sexual harassment in the Middle East

Sexual harassment is a pervasive and serious problem that affects the lives and well-being of many women and men in the Middle East. According to a UN Women survey, online harassment was the most common type of violence against women in nine countries in the region during the pandemic (Ranganathan et al., 2021). However, sexual harassment is not limited to the online sphere but also occurs in various forms, including gender harassment, unwanted sexual attention, and sexual coercion in different settings such as workplaces, educational institutions, public places, and homes. Throughout the region, gender harassment often manifests through verbal abuse, derogatory comments, or discriminatory behaviour towards women (Asl, 2023; Hadi and Asl, 2022). Previous studies highlight how patriarchal norms and traditional gender roles contribute to gender harassment in this region. In particular, the cultural emphasis on modesty and honour perpetuates gender harassment by placing blame on women for their attire or behaviour. The concept of “honour” has become a tool for controlling women’s actions and justifying harassment (Asl, 2022, 2020; Asl and Hanafiah, 2023; Chew and Asl, 2023; Yan and Asl, 2023). Gender harassment is perpetrated to reinforce power imbalances between men and women in Middle Eastern societies. Men often exert dominance over women through verbal abuse or by limiting their access to public spaces (Wei and Asl, 2023). Numerous studies have shown that gender harassment perpetuates a culture of silence where victims are discouraged from speaking out due to fear of social stigma or retaliation, hindering progress towards gender equality and reinforcing harmful stereotypes about women’s roles in Middle Eastern societies (Asl, 2019).

Both unwanted sexual attention and sexual coercion are also influenced by cultural norms surrounding modesty and sexuality. Modesty is highly valued in many Middle Eastern cultures to preserve honour and maintain social order (Ennaji and Sadiqi, 2011). Unwanted sexual attention is often seen as a violation of these cultural norms, leading to victim-blaming and shaming (Eltahawy, 2015). It is argued that the prevalence of unwanted sexual attention perpetuates a culture of fear and insecurity for women in the Middle East. It restricts their freedom of movement and limits their opportunities for education and employment, hindering their overall empowerment (Bouhlila, 2019). In cases of sexual coercion, victims often face immense pressure to remain silent due to fears that their reputation or family’s honour will be tarnished, which perpetuates a cycle of violence and oppression within Middle Eastern societies. Victims often find themselves trapped in abusive relationships without access to legal protection or support systems, leading to long-term psychological trauma. Both types of sexual harassment are often justified or normalized by the harassers as a way of expressing their masculinity and asserting their dominance.

Text classification techniques

Sexual harassment can be investigated using computation literary studies that the activities and patterns disclosed from large textual data. Computational literary studies, a subfield of digital literary studies, utilizes computer science approaches and extensive databases to analyse and interpret literary texts. Through the application of quantitative methods and computational power, these studies aim to uncover insights regarding the structure, trends, and patterns within the literature. Computational literary studies encompass various disciplines such as computational linguistics, statistical methodology, natural language processing, machine learning, and text mining (Da, 2019; Elmi et al., 2023; Mohd Amram et al., 2023, Zhao and Keikhosrokiani, 2022). The field of digital humanities offers diverse and substantial perspectives on social situations. While it is important to note that predictions made in this field may not be applicable to the entire world, they hold significance for specific research objects. For example, in computational linguistics research, the lexicons used in emotion analysis are closely linked to relevant concepts and provide accurate results for interpreting context. However, it is important to acknowledge that embedded dictionaries and biases may introduce exceptions that cannot be completely avoided. Nonetheless, computational literary studies offer advantages such as quick interpretation, analysis, and prediction on extensive datasets (Kim and Klinger, 2018).

Natural language processing (NLP) techniques have been widely adopted for text classification, which assigns labels to sentences, paragraphs, or documents (Abadah et al., 2023; Asri et al., 2022; Chu et al., 2022; Fasha et al., 2022; Jafery et al., 2023; John and Keikhosrokiani, 2022; Al Mamun et al., 2022). This technique has been applied to a variety of fields, such as health, social science, business marketing and law. In particular, text classification has been used to uncover human activity in the past decades in the social sciences. Researchers have used text data such as chat messages, notes, and social media posts for analysis. Text classification can be applied at four different levels of text size: document level, paragraph level, sentence level and sub-sentence level. The process of text classification involves four phases: feature extraction, dimension reduction, classification selection and evaluation. Firstly, unstructured text data must be converted into structured data and then cleaned to retain important characters and words in feature extraction. Secondly, dimensionality reduction may be optionally applied to reduce time and memory complexity if the pre-processed data is large. Thirdly, machine learning models, deep learning models, and ensemble-based learning must be employed for text classification. Finally, the trained classification model must be evaluated to understand its performance. (Kowsari et al., 2019).

Machine learning-based text classification

Some machine classification technique was introduced and tabulated in Table 1. Rocchio classification uses the frequency of the words from a vector and compares the similarity of that vector and a predefined prototype vector. This classification is not general because it is limited to retrieving a few relevant documents. Boosting and Bagging are voting classification techniques used in text classification. Boosting is trained by ensemble learning, where the weight of the data point changes based on the previous performance. Bagging algorithm generated a sub-sample from the training set and trained different models, and the prediction was the most voted among the trained models. The limitations of Boosting and Bagging are the computational expensive and lack of interpretability. Logistic regression is a statistical model based on a decision boundary to predict the probability of labels. The data point must be independent to perform well in prediction. Naïve Bayes classification is popular in document categorization and information retrieval. This model used the frequency of the words in the document and based on Bayes theorem to predict the probability of the models. The limitation of Naïve Bayes models is the modal has a strong assumption on the distribution of data that must obey on Bayes theorem. K-nearest neighbours (KNN) algorithm predicts the class based on the similarity of the test document and the k number of the nearest document. KNN requires large memory to store the data points and it is dependent on the variety of trained data points. Support vector machine (SVM) developed a features map for the frequency of the words and a hyperplane was found to create the boundary between the class of data. The SVM model is time-complexible and has high memory usage. Decision tree model is a statistical model that categorizes the data point past on the entropy of nodes to form a hierarchical decomposition of data spaces. Decision trees are sensitive to small perturbations in the trained data. Random Forest is an ensemble learning that parallel builds multiple random decision trees, and the prediction is based on the most voted by the trees. Random forest required more training time compared to other machine learning techniques. Conditional random field (CRF) is an undirected graphical model, and it has high performance on text and high dimensional data. CRF builds an observation sequence and is modelled based on conditional probability. CRF is computationally complex in model training due to high data dimensionality, and the trained mode cannot work with unseen data. Semi-supervised is one type of supervised learning that leverages when there is a small portion of labelled with a large portion of unlabelled data. Clustering technique was used to find if there is more than one labelled cluster or to handle the data in labelled and unlabelled clusters (Kowsari et al., 2019).

Table 1 Idea and limitation of machine learning-based text classification.

Subjects

Abstract

Similar content being viewed by others

Amharic political sentiment analysis using deep learning approaches

Deep cascaded multitask framework for detection of temporal orientation, sentiment and emotion from suicide notes

Machine learning-based guilt detection in text

Introduction

Background

Sexual harassment types

Sexual harassment in the Middle East

Text classification techniques

Machine learning-based text classification

Deep learning-based text classification

Related works on text classification

Sentiment and emotion analysis techniques

Related works on sentiment and emotion analysis techniques

Method

Data source

Text preparation

Format conversion

Sentences tokenization

Expanding contraction

Part-of-speech (POS) tagging

Word tokenization

Lower case conversion

Stop word removal

Lemmatization

Text classification

Sentiment and emotion analysis

Lexicon-based sentiment analysis

Lexicon-based emotion analysis

Results and discussion

Text classification

Sentiment and emotion analysis

Lexicon-based sentiment analysis

Lexicon-based emotion analysis

Sentiment classification

Emotion classification

Conclusion

Data availability

References

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Ethical approval

Informed consent

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Search

Quick links