A framework for automated detection of offensive messages in social networks in Kiswahili

Barongo, Everyjustus

A framework for automated detection of offensive messages in social networks in Kiswahili

Authors

Barongo, Everyjustus

Publisher

The University of Dodoma

Description

Dissertation (MSc Computer Science)
The diffusion of information generated in Social Networks Sites is the result of more people being connected. The connected people chats and comments by posting contents like images, video, and messages. In fact the social networks have been and are useful to communities in such they bring relatives together especially in sharing experiences and feelings. Although social networks have been beneficial to users, some of the shared messages and comments contain sexual and political harassments. This is particularly the same in Kiswahili speaking countries like Tanzania. Most if not all of the Kiswahili social networks sites, the offensive messages have been and are publicly posted. These messages harass, embarrass, and even assault users and to some extent lead to psychological effect. This study proposes a framework for automating the detection of offensive messages on social networks in Kiswahili settings by applying some selected machine learning techniques. Specifically, the study created Kiswahili dataset containing sexual and political offensive messages and normal messages1. All of these messages were collected from Facebook, YouTube, and JamiiForum and they were used for evaluating the performance of the selected text classification algorithms. The collected messages were preprocessed by using Bag-of-Word (BoW) model, Term Frequency Inverse Document Frequency (TF-IDF) and N-grams techniques to generate feature vectors. The experimental findings using the generated feature vectors showed that the Random Forest classifier was capable of correctly assigning a message into a correct class label with an accuracy of 95.0259 %, f1- Measure of 0.950 (95.0%) and false positive rate of 2.8 % when applied to three categorical dataset. On the other hand, the SVM-Linear showed better results when applied in two categorical data. The study suggests the REST API based framework with random forest classifier and Kiswahili dataset to be deployed in real social net

Keywords

Social Media, Kiswahili, Jamii Forum, Social networks, Social media, Offensive messages, Sexual offensive messages, Political offensive messages, Sexual harassment, Political harassment, Harassment, offensive messages