Multi-Label Classification of Indonesian Hate Speech On Twitter Using Support Vector Machines [PDF]

Multi-label Classification of Indonesian Hate Speech on Twitter Using Support Vector Machines Karimah Mutisari Hana Adiw

9 0 284 KB

File loading please wait...

Citation preview

Multi-label Classification of Indonesian Hate Speech on Twitter Using Support Vector Machines Karimah Mutisari Hana Adiwijaya Said Al Faraby School of Computing School of Computing School of Computing Telkom University Telkom University Telkom University Bandung, Indonesia Bandung, Indonesia Bandung, Indonesia [email protected] [email protected] [email protected]

Abstract—Hate speech has become a hot issue as it spreads massively on today’s social media with specific targets, categories, and levels. In addition, hate speech can cause social conflict and even genocide. This research proposes a system that classifies hate speech written in Indonesian language on Twitter. It also handles the noisiness of twitter data, such as mixed languages and non-standard text. We not only use Support Vector Machines (SVM) as a classifier, but also compare it with other methods, such as deep learning, CNN and DistilBERT. Apart from standard text preprocessing, we propose to accommodate the effect of translating in handling the multilingual content. The data transformation methods used in the SVM model are Label Power-set (LP) and Classifier Chains (CC). The experiment result shows that the classification using the SVM and CC without stemming, stopword removal, and translation provides the best accuracy of 74.88%. The best SVM hyperparameter on multilabel classification is the sigmoid kernel, the regularization parameter value of 10, and the gamma value of 0.1. Stemming, stopword removal, and translation preprocessing are less effective in this research. Moreover, CNN has a flaw in predicting labels for the training data with a low occurrence rate. Keywords—classification, hate speech, social media, support vector machine

I. INTRODUCTION The number of active users in Indonesian social media is growing rapidly1, 2. One of the social media that actively used by Indonesian people is Twitter. Twitter is a platform to express opinions freely. However, the potential abuse of social media often occurs by throwing a hate speech to others. In Indonesian terminology, hate speech is defined as an act to offend certain individuals or groups based on ethnicity, religion, race, and intergroup [1]. Hate speech is usually accompanied by the abusive words that can trigger the negative emotion, hence, it can accelerate the growth of social conflict and even genocide [1], [2]. However, abusive words are not always used in hate speech. Abusive words are sometimes used as a joke even though their use can cause misunderstandings [2]. Therefore, the detection system for hate speech and abusive words in Indonesian tweets is needed. The detection of hate speech is becoming a hot issue because hate speech is spreading massively on social media today. Sentiment analysis can be used to detect hate speech because it can analyze the subjective information of specific objects such as sentences [3]. Several studies related to hate speech in Indonesian social media have been conducted [2], 1 Simon Kemp, "Digital 2018: Indonesia", DataReportal, February 1, 2018, https://datareportal.com/reports/digital-2018-indonesia. 2 Simon Kemp, "Digital 2019: Indonesia", DataReportal, January 31, 2019, https://datareportal.com/reports/digital-2019-indonesia.

Arif Bramantoro Faculty of Computing and Information Technology King Abdulaziz University Jeddah, Saudi Arabia [email protected]

[5]-[7]. These studies used Twitter as the source of the dataset and the machine learning as the proposed approach. Machine learning can process huge resources with a good performance [4]. In [5], the performance of hate speech detection was compared for different combination of features and machine learning algorithms for the detection of hate speech in tweets related to the 2017 Jakarta Governor Election. In [6], hate speech detection is performed using the assembled methods: Naïve Bayes (NB), K-Nearest Neighbors (KNN), Maximum Entropy (ME), Random Forest (RF), and Support Vector Machine (SVM). These researches classify Indonesian tweets into two labels, namely hate speech and non-hate speech. Hate speech has a specific target, category, and level, hence, it is important to be classified to facilitate the authorities in following up on the issue of hate speech in Indonesia. Research that detects hate speech by accommodating various targets, categories, and hate speech levels in Indonesian tweets is found in [2]. In that research, a dataset was built for abusive words and hate speech including targets, categories, and speech levels. In addition, several test scenarios were performed to compare the performance of various classification methods. Multilingual communication often occurs in Indonesian social media. This is one of the challenges in detecting hate speech on Indonesian social media. Research on identifying multilingual hate speech has been conducted in [7]. The research used tweet data from several languages, such as Indonesian, English, and Hindi. Different from [7], this study aims to identify abusive words and hate speech that includes targets, categories, and speech levels in Indonesian language tweets. In addition, this research was conducted for handling bilingual tweets that consist of Indonesian and English. The classification task of hate speech conducted in this study is a common natural language text classification task. The common natural language text classification utilizes the label of text that is determined only by its content without depending on the external knowledge (e.g. situation and condition) [8]. The main contribution of this work is as follows: • Enriched the dataset by adding 5227 tweets • Enriched the normalization dictionary by adding 2154 normalized words • Translated English words to Indonesian during the preprocessing stage and analyzed the effect of the translation on the classification results • Compared the classification results using machine learning and deep learning methods

II. RELATED WORK Some researches related to hate speech in Indonesian social media have been conducted [2], [5]-[7]. They used a machine learning approach and Twitter as the source of the dataset. In [5], the performance was compared for different combination of features and machine learning algorithms for the detection of hate speech in tweets related to the 2017 Jakarta Governor Election. The dataset in the study was labeled hate speech or non-hate speech by 30 people with various gender, religious, racial/ethnic backgrounds to reduce the bias on data. Both stand-alone and combined classification methods have been implemented in [6] to classify hate speech. The proposed classification methods are Naïve Bayes (NB), Maximum Entropy (ME), Random Forest (RF), and Support Vector Machine (SVM). The research shows that the methods aggregation can improve classification performance. It shows that the stand-alone NB method and the NB, SVM, and RF combined methods provide the best results for unbalanced data, while the stand-alone ME method provides the best results for balanced data. Hate speech is typically accompanied by the use of abusive words. However, abusive words are not always used for hate speech. Therefore, multilabel classification of hate speech is proposed in [2]. Tweets are classified into not hate speech, not hate speech but abusive, hate speech but not abusive, and abusive hate speech. In addition, the classifier identifies tweets based on targets, categories, and hate speech levels. Several types of feature extraction, data transformation, and classification methods are used in this research. The best result is obtained using the word unigram feature, Label Power-set (LP), and Random Forest Decision Tree (RFDT) with the accuracy of 66.12%. Research on identifying multilingual hate speech has been carried out in [7]. The research used tweet data from several languages, such as Indonesian, English, and Hindi. In [7], monolingual and multilingual hate speech are classified. The monolingual classification process used three classification methods, namely SVM, NB, and RFDT. The multilingual classification process used three methods: non-translated, translated without language identification, and translated with language identification. The best results for the classification of monolingual hate speech are obtained using SVM, while the best results for the classification of multilingual hate speech are obtained by the non-translated method. However, the non-translated method still needs to be considered because it requires a lot of data. In this study, we propose a focused classification using SVM and compare it to CNN with and without the pre-trained DistilBERT model. The data transformation methods that we use to combine the SVM model are Classifier Chains (CC) and Label Power-set (LP). We also use TF-IDF as feature extraction for the classification using SVM. For the preprocessing stage, we compare some scenarios, such as with and without translation, full preprocess, without stopword removal; without stemming and stopword removal. III. METHODOLOGY The system in this study can classify tweets into 12 multilabel data classes. The classification system consists of four stages: 1) preprocessing; 2) feature extraction; 3) classification; and 4) evaluation.

A. Dataset The Twitter dataset used in this research is a dataset from [2] consisting of 13169 tweets3. In addition, we also collected 5227 tweets using Twint4. Each collected tweet is manually labeled by three annotators. The final label on each tweet is obtained through a voting of each label. Total tweets in the dataset are 18396 tweets consisting of 12334 tweets in Indonesian, 3780 tweets in English, 1991 tweets mixed in Indonesian and English, and 291 tweets in other languages. The labels used in this study are the same as the labels used in [2]. The label selection in [2] was obtained from a Focus Group Discussion (FGD) with the police staff of Direktorat Tindak Pidana Siber Badan Reserse Kriminal Kepolisian Negara Republik Indonesia (Bareskrim Polri), who is the authority in handling cybercrime. The labels used in this study are as follows. All of the examples given below are translated from Indonesian to English. 1) Hate Speech (HS) This label is used for a tweet that offends certain individuals or groups. If a tweet has this label, it must have a label that indicates the target, category, and level of hate speech. An example of the tweet that has this label is “Step down Jokowi because he is not only violating the laws, but also incompetent in managing the country”. 2) Abusive This label is used for a tweet that contains abusive words. An example of the tweet that has this label is “You cannot even talking right, you carcass”. 3) Individual Hate Speech (HS_Individual) This label is used for hate speech aimed at individual. An example of the tweet that has this label is “The worst president in the history of tadpole mankind”. 4) Group Hate Speech (HS_Group) This label is used for hate speech aimed at a certain group (religion groups, races, politics, etc.). An example of the tweet that has this label is “Of course, when this AniesSandi's success team is tweeting, his is silly and pretentious champ”. 5) Religion Hate Speech (HS_Religion) This label is used for hate speech based on religion, religious organizations, or certain creed. An example of the tweet that has this label is “'USER USER White turban in white robe is rotten like camel shit bro”. 6) Race Hate Speech (HS_Race) This label is used for hate speech based on race or ethnicity. Example of the tweet that have this label is “USER USER There are indeed Chinese who had fought, but much more traitors as Dutch slaves..”. 7) Physical Hate Speech (HS_Physical) This label is used for hate speech based on physical differences (e.g. face shape, nose, and other body parts) or disability (e.g. autism, idiot, blind, etc.). An example of the tweet that has this label is “A woman like you just die, just ugly, don't be pretentious like a make-up artist!”. 3 https://github.com/okkyibrohim/id-multi-label-hate-speech-andabusive-language-detection 4 https://github.com/twintproject/twint

8) Gender Hate Speech (HS_Gender) This label is used for hate speech based on gender. It curses someone (or a group) using the words that are degrading the gender or certain sexual orientation. An example of the tweet that has this label is “Are you transgender or what, you dog :( URL”. 9) Other Hate Speech (HS_Other) This label is used for hate speech in the form of ridicule or slander which are not related to religion, race, physical, or gender. An example of the tweet that has this label is “'RT USER: Foolish admin in the idiot party \n Fool Party for the Fools”. 10) Weak Hate Speech (HS_Weak) This label is used for hate speech aimed at someone without any incitation or provocation (usually a personal matter). An example of the tweet that has this label is “'USER who the hell are you, you are not my friend, you dirt, I've been spoiled myself”. 11) Moderate Hate Speech (HS_Moderate) This label is used for hate speech aimed at a certain group without any incitation or provocation. A conflict is expected to occur only on social media. An example of the tweet that has this label is “USER The tadpole is really good at making and spreading hoaxes”. 12) Strong Hate Speech (HS_Strong) This label is used for hate speech aimed at someone or a group by inciting or provoking. This kind of hate speech needs to be prioritized to be resolved immediately because it can invite widespread conflict and real-world destruction. An example of the tweet that has this label “The 4 parties that walked out were all big parties. The bull is gone. The key is on the hands of the people. Don't wait for 2019, immediately depose Jokowi. REVOLUTION”. B. Preprocessing There are 5 preprocessing steps as follows. 1) Data cleaning We remove punctuation, numbers, and special symbols such as usernames, re-tweet (RT) symbols, URLs, and HTML character entity references. 2) Word normalization We change a non-standard word into a standard word, such as from ‘saaaaaad’ to ‘sad’. Word normalization in this study uses a dictionary from [2], which consists of 15167 words. In addition, we look at each tweet in the dataset and add all non-standard word to the dictionary. The total number of non-standard words that we added to the dictionary is 2154 words. 3) Translate We translate English into Indonesian using the Google translate API, since our system is built to classify hate speech in Indonesian. 4) Stemming We use PySastrawi to stem Indonesian words, and WordNet Lemmatizer module from the NLTK library to stem English words.

5) Stopword removal This process uses a stopword list that has been created. In addition, we use the Indonesian and English stopword module from the NLTK library. C. Feature Extraction In this research, we used Term Frequency-Inverse Document Frequency (TF-IDF) as a feature extraction method. TF-IDF is popular for text classification. It can describe how important the word is in a text [9]. It is a combination of Term Frequency (TF) and Inverse Document Frequency (IDF). TF refers to the number of occurrences of a word in a text. The occurrence of the word in a text is related to the importance of the word in a text. IDF is the number of documents containing certain words. The larger the IDF value, the less the word importance. TF-IDF of the word in document can be calculated with the following equation. −

( , ) =

,

×

(1)

, is the occurrence frequency of the word in dataset and is the number of documents in the dataset. is the number of documents in the dataset containing the word .

D. Classification 1) Support Vector Machine SVM is a supervised learning classification method that has a fairly good performance in identifying hate speech [7]. SVM is used to find the optimal hyperplane to separate positive and negative samples [10]. The training data in SVM is represented as = ( , ), where ∈ is the input pattern, and ∈ {−1, 1} is the label of two classes (positive and negative). The main challenge in SVM is to minimize the error rate in the classification and to maximize the margin between positive and negative. The SVM optimization equation can be transformed into a duality equation and solved using the Lagrange Multiplier method. The duality equation is shown by the following equation. max

= max S.t. ∑

−

1 2

= 0, 0 ≤

,

(2)

≤

Where is the Lagrange multiplier, ( , ) is the kernel function, and is the regularization parameter. The value of needs to be maximized to find the optimal hyperplane. a) Regularization Parameter Regularization parameter ( ) is a parameter used to adjust the penalty of error. Classification performance will increase if the value of the specified is appropriate [11]. Conversely, increasing too many values can cause overfitting. b) Kernel Function In general, most problems in the real world are non-linear. Therefore, the kernel was introduced to overcome the problem of non-linear classification. This technique can map the original data into high-dimensional kernel space, hence, the data can be linearly separated. The kernel technique is widely

c) Gamma Gamma is one of the parameters used by a non-linear hyperplane. The smaller the gamma value, the more hyperplane becomes linear. Conversely, the higher the gamma value, the more hyperplane becomes curved and the more the training data fits [12]. However, increasing the value of gamma too much may lead to overfitting the training data. 2) Data transformation We use the data transformation method to apply the SVM model on handling the multilabel classification in the text. Data transformation methods used are Label Power-set (LP) and Classifier Chains (CC). LP is a simple but effective data transformation method [13]. LP can change multi-label classifications into multi-class classifications by considering each different sequence of the label as a unique multi-class label [14]. CC is a chain of binary classifications that predicts labels by paying attention to label sequences [15]. Labels that have been classified will be used for the next label classification. The sequence of labels in this research: 1) Hate Speech (HS); 2) Abusive; 3) HS_Individual; 4) HS_Group; 5) HS_Religion; 6) HS_Race; 7) HS_Physical; 8) HS_Gender; 9) HS_Other; 10) HS_Weak; 11) HS_Moderate; and 12) HS_Strong. 3) Convolutional Neural Network CNN is known as a model that has a good performance in processing images. In addition, CNN has a fairly good performance in processing text, because it can find the pattern of a sentence [16]. CNN in this study uses six layers: 1) input layer; 2) embedding layer; 3) convolutional 1-D layer; 4) global average pooling layer; 5) full connected hidden layer; 6) classification/output layer. We also use the pre-trained model on the CNN method, namely DistilBERT. DistilBERT is a distilled version of the BERT pre-trained model. The model is smaller than BERT, works faster, and able to maintain 97% of BERT's ability to understand a language [17]. The dataset used in this research is multilingual tweet data. Therefore, we use a model from the distillation of Multilingual BERT model bert-basemultilingual-cased checkpoints, namely Multilingual DistilBERT model distilbert-base-multilingual-cased. The model has 6 layers, 768 hidden, 12 heads, and 134M parameters. Once the multilingual DistilBERT model trains the dataset, the results of the training will be further processed by CNN. TABLE I.

E. Evaluation The final stage carried out to measure the ability of the built system is evaluation. The evaluation method used in this study is label-based accuracy with the following equation. =

()

1



()

( )  ( )

× 100% (3)

is the number of documents in the dataset, ( ) is the prediction label for the document , and ( ) is the actual label for document . IV. EXPERIMENTAL RESULT We conducted several experiments to find the best method in multilabel classification for hate speech tweets. We compared the combination performance of the models and preprocesses. Several preprocess scenarios are tested, i.e. 1) preprocess without translation; 2) preprocess without stopword removal and translation; 3) preprocess without stopword removal, stemming and translation; 4) full preprocessed; 5) preprocess without stopword removal; and 6) preprocess without stopword removal and stemming. The label-based accuracy of each scenario is shown in Table I. The best accuracy is obtained using the SVM model and CC with dataset without stemming, stopword removal, and translation that is 74.88%. A. SVM Hyperparameter Analysis We analyze the SVM hyperparameter using the best scenario in stages. First, we run a test scenario to find the best kernel. The kernels tested are the linear kernel, polynomial kernel, sigmoid kernel, and radial based kernel. The results of the kernel testing scenario are shown in Fig. 1. 80 Label-based accuracy (%)

used in the application of SVM. Some popular kernel functions are the linear, polynomial, radial based, and sigmoid kernel [11].

74.88

74.86

Sigmoid

Radial based

71.3

70 60 50

39.95

40 30 20 10 0 Linear

Polynomial

Kernel Fig. 1. Kernel test scenario

LABEL-BASED ACCURACY FOR MULTILABEL CLASSIFICATION OF HATE SPEECH

Preprocessing

SVM + CC

Label-based Accuracy (%) SVM + LP CNN CNN + DistilBERT

Preprocessing without translation

74.19

72.90

60.33

61.28

Preprocessing without stopword removal and translation Preprocessing without stopword removal, stemming, translation Full preprocessed

74.79

73.59

65.07

61.24

74.88

74.36

62.40

61.30

and

73.08

71.76

62.83

-

Preprocessing without stopword removal

74.65

73.52

63.17

-

Preprocessing without stopword removal and stemming

74.25

74.34

63.76

-

Label-based accuracy (%)

Label-based accuracy (%)

80 70 60 50 40 30 20 10 0 0.01

0.1

10

100

80 70 60 50 40 30 20 10 0 0.0001 0.001

0.01

As Fig. 1 illustrates, the best kernel in the multilabel classification of hate speech is the sigmoid kernel. This shows that the data in this study are spread non-linearly. The best kernel will be used in the next test scenario. Next, we conduct a test scenario to find the best value. values tested were 0.01, 0.1, 10, and 100. As Fig. 2 illustrates, the optimal value is 10. The value of label-based accuracy increases as the value increases and decreases as the value is more than 10. This shows that increasing the value too much may cause overfitting and reduce the classification performance. To validate the cause of the decline in accuracy at values above 10, we carried out several further tests. First, we make predictions on the training data using a value of 100 and compare it using the best value. Second, we make predictions for the HS (Hate Speech) label with the same hyperparameter on the training and test data. The results of further testing are shown in Table II. Based on the results shown in Table II, it can be concluded that the SVM model that uses a value equal to 100 is overfitted. Overfitting on the SVM model with a value equal to 100 is indicated by high label-based accuracy values on the training data and too far from the label-based accuracy values in test data. The results of predictions on the HS label also indicate that the accuracy of the training data is too far from the test data. In the SVM model with a value equal to 10, the accuracy value on the training data, both for multilabel cases and the HS label only, does not show a significant difference with the accuracy value in the test data. The accuracy value in the test data from the SVM model with a value equal to 10 is also higher than the SVM model with a value equal to 100. After probing the best value, we use the best value to find the best gamma value. The gamma values tested were 0.0001, 0.001, 0.01, 0.1, 1, 10, and 100. FURTHER TESTING RESULTS FOR HYPERPARAMETER Label-based Accuracy (%)

Label

All labels HS only

1

10

100

Fig. 3. Gamma testing scenario

Fig. 2. Regularization parameter ( ) test scenario

TABLE II.

0.1 Gamma

Regularization parameter (C)

100

Train data 89.62

Test data 69.03

10

82.95

74.88

100

92.55

79.38

10

87.53

82.31

As Fig. 3 illustrates, the best gamma value is 0.1. The value of label-based accuracy continues to increase along with the increase in gamma value and decreases when the gamma value is more than 0.1. We also conducted the same test scenario for hyperparameter , to validate the cause of the decrease in accuracy at gamma values above 0.1. The results of the test scenario are shown in Table III. Based on Table III, it can be concluded that the SVM model with gamma equal to 100 is underfitted. Underfitting in SVM models with gamma equal to 100 is indicated by the low accuracy values on the training data and test data, both for multilabel cases and HS label only. SVM models with gamma values equal to 0.1 have high accuracy values on the training data and test data, both for multilabel cases and HS labels only, with a significant difference, compared to gamma values equal to 100. B. The Effect of Stemming and Stopword Removal In general, the testing scenarios without stemming and stopword removal have better results than the ones using stemming and stopword removal. Stemming can eliminate affixes that might impact the meaning of a word. For example, the word "plebeian" in Indonesian is "kampungan". If it is stemmed, it becomes "kampung" which means "village". The two words have completely different meanings in Indonesian, hence, the classification can be wrong. Stopword removal can cause loss of words that are important in determining the target of hate speech, e.g. "I", "you", and "they". In addition, negation words, such as "no" and "don’t", are missing, so that it can change the meaning of the tweet and cause the wrong classification. Therefore, the process of stemming and stopword removal is considered less effective in multilabel classification of hate speech on social media Twitter. TABLE III.

FURTHER TESTING RESULTS FOR HYPERPARAMETER GAMMA

Label

All labels HS only

Gamma

Label-based Accuracy (%)

100

Train data 31.86

Test data 31.73

0.1

82.95

74.88

100

55.30

54.87

0.1

87.53

82.31

C. The Effect of Translation Based on the label-based accuracy, the combination of the SVM model without translation is better than the combination of the SVM model with translation. The translation still have deficiencies, such as the translation result has the meaning of the word that does not suit the context of the tweet. For example, the word "wired" in the tweet "Coffee really got me wired right now. A non coffee drinker" is translated into "kabel" which means "wire". There are also Indonesian words translated into English, even though we use a translator from English to Indonesian, e.g. the word "di sini" translated into "here". In addition, there is a translation error like the word "orang" which means "people" translated to "orangutan". D. The Effect of Word Normalization We conducted a test scenario to observe the effect of word normalization in multilabel classifications. We compare the best scenario using word normalization and the best scenario without word normalization. The best scenario without word normalization has a label-based accuracy of 74.84%. In other words, the best scenario that uses word normalization is better than the best scenario without word normalization. We observed 200 sample tweets to see how effective word normalization using the normalization dictionary that was previously created. The number of normalized words in the tweets is 941 words, consisting of 908 words that are normalized according to the context of tweets and 33 words that are normalized but do not fit into the context of tweets. An example of the normalized words that do not fit into the tweet context is the word “sma” which means "high school" in the tweet “sma students bilah hilir district declare anti hoax; peace pilkada 2018” normalized to “sama” which means "the same", whereas the meaning of the word that fits the tweet context is “high school”. Based on this analysis, word normalization using the normalization dictionary is quite effective in classification, although there are some normalization errors. E. Classification Model The SVM model has a better accuracy than the CNN model. Most scenarios that use the CNN model fail to predict labels with a low occurrence in the dataset. We analyze the effect of the number of the trained tweets per label on the number of predicted tweets per label using the CNN model without stopword removal, stemming, and translate. Several examples of labels with low occurrence rates are HS_Race, HS_Physical, HS_Gender, and HS_Strong. The results of the label analysis are shown in Table IV. In the classification using the CNN model, the number of tweets that are predicted to have the label is very low or even zero. TABLE V. Label sequence

TABLE IV.

TOTAL TWEETS OF EACH LABEL ON TRAINING DATA AND RESULTS OF MODEL PREDICTION

Label

Train data

Total tweet CNN prediction

SVM prediction

HS_Race

433

8

103

HS_Physical

251

0

35

HS_Gender

248

0

36

HS_Strong

375

0

85

The number of tweets that is predicted to have the HS_Race label is 8 tweets. However, only 6 tweets are classified correctly. There is no predicted tweet that has the HS_Physical, HS_Gender, and HS_Strong labels. In contrast, the SVM model is capable to predict tweets with labels that have a low occurrence rate. The CNN model is not able to identify dependencies between labels, so that a lot of tweets are misclassified. For example, the tweet “Thats not common bullshit, that is pure shit... RT USER: DPR #commonbullshit” predicted as a hate speech with an abusive word. Every hate speech must be classified by the target, category, and level of hate speech. However, the system that uses the CNN model does not classify targets and speech levels of the tweet. Furthermore, the results of the data transformation analysis show that the CC method has better results than LP. In LP data transformation, the number of classes or power-set of all labels is high. Therefore, some classes exist in the training data but not in the test data, and vice versa. In this study, there are ten sequences of labels that only exist in the training data. A total of 17 tweets have those sequences, so that the tweets do not affect the test data. The sequence of the labels with low occurrence in the training data is difficult to predict in the test data. In this case, CC has a better performance because the classification is employed for each label. Table V shows an example of the classification results for the tweets that have the label sequence with a low occurrence. The label sequence "1,1,0,1,0,0,0,0,1,0,0,1" appears in 13 tweets in the training data and three tweets in the test data. The SVM model with the CC method correctly predicts one tweet with that label. Meanwhile, two other tweets are wrongly classified. Both tweets have one label that is misclassified. The SVM model that uses LP is not able to predict the label at all. All three tweets in the test data are wrongly classified. The first and second tweets have one label that is misclassified, while the third tweet has two labels that are misclassified.

CLASSIFICATION RESULTS OF LABEL SEQUENCE WITH LOW OCCURRENCE Frequency in train data

Predicted tweets

Frequency in test data

CC

Correct predicted tweets

LP

CC

LP

1,1,0,1,0,0,0,0,1,0,0,1

13

3

1

0

1

0

1,0,1,0,1,0,0,0,0,0,0,1

4

3

1

0

0

0

TABLE VI. Cause of Misclassification

Word Meaning

OOV

Word normalization

THE EXAMPLE OF MISCLASSIFIED TWEET Tweet

Anjing juga:( dogs deserve to live just as well as cats should. https://twitter.com/meidianatahir/status/12232 66322999435264ÃƒÂ‚Ã‚Â ÃƒÂ¢Ã‚Â€Ã‚Â¦ In English: Dogs too:( dogs deserve to live just as well as cats should. https://twitter.com/meidianatahir/status/12232 66322999435264ÃƒÂ‚Ã‚Â ÃƒÂ¢Ã‚Â€Ã‚Â¦ USER Jancuk! Pingin gue tabok itu anak' In English: USER Fuck you! I want to smack that kid RT USER USER kebesaran jiwa anies?? yang ada kebesaran ego dan ambisi. liat cara kampanyenya yang brutal 11 12 ama sandi In English: RT USER USER Anies’ big heart?? In fact, excessive ego and ambition. Looked from his brutal campaign, just like sandi

F. Tweet Classification Error We analyzed several causes for a tweet being misclassified. The classification using the best scenario has 1172 tweets that are misclassified. They consist of 959 tweets containing out of vocabulary (OOV) and 213 tweets containing no OOV. We took 100 samples from tweets that contain OOV and are misclassified. Based on the analysis of sample tweets, 79 tweets were misclassified because the system fails to understand the meaning of the tweets, 20 tweets were misclassified because there was an OOV, and one tweet was misclassified due to normalization errors. The same situation appears for the tweets that do not contain OOV and misclassified. The result shows that all tweets are misclassified because the system fails to understand the meaning of the tweets. Table VI shows several examples of tweets that cause misclassification. Based on the example tweet that being misclassified because of a misunderstanding, we suspect that the word "dog" in a tweet made the system decide that the tweet contained the abusive word. In fact, the meaning of the word "dog" in this tweet is literally an animal, not a curse. The second tweet in the table is a tweet that contains hate speech. However, the word "tabok" which means "smack" in the tweet is OOV. This is considered to cause the system to classify the tweet as hate speech with an abusive word. The last example in the table is classified as hate speech that does not contain the abusive word. However, the word "ego" on the tweet is normalized to "bego" which means "stupid". This is considered to cause the system to classify the tweet as hate speech with an abusive word. V. CONCLUSION In this research, we conducted several experiments to find the best method for multilabel classification on hate speech tweets. Based on the experiments, the classification using the SVM model and CC with the dataset without stemming, stopword removal, and translation is the best scenario with 74.88% of label-based accuracy. Moreover, the best SVM hyperparameter on multilabel classification is the sigmoid kernel, the regularization parameter value of 10, and the

gamma value of 0.1. The regularization parameter values that are too high can cause overfit and the gamma values that are too high can also cause underfit. SVM model with CC gives better results than the SVM model with LP, CNN model, and CNN model with DistilBERT. Stemming and stopword removal are less effective in this research. In addition, the translation process using Google translation API has remaining ambiguities and errors. For future work on multilabel classification on Indonesian hate speech, we suggest three improvements. First, it is suggested to add a dataset that makes the number of tweets for each label more balanced. Second, it is advised to improve the normalization dictionary in order to make classification more effective. Third, we suggest to use a language model in handling ambiguities and errors during the translation. REFERENCES [1] [2] [3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13] [14]

[15]

[16] [17]

National Human Rights Commission of Indonesia, Pocket Book for Handling Hate Speech. , 2015. M. O. Ibrohim and I. Budi, “Multi-label Hate Speech and Abusive Language Detection in Indonesian Twitter,” 2018. A. I. Pratiwi and Adiwijaya, “On the Feature Selection and Classification Based on Information Gain for Document Sentiment Analysis,” Appl. Comput. Intell. Soft Comput., vol. 2018, 2018. N. Octaviani and F. Daeli, “Sentiment analysis on movie reviews using Information gain and K-nearest neighbor,” J. Data Sci. Its Appl. 3, 2020. I. Alfina, R. Mulia, M. I. Fanany, and Y. Ekanata, “Hate speech detection in the Indonesian language: A dataset and preliminary study,” 2017 Int. Conf. Adv. Comput. Sci. Inf. Syst. ICACSIS 2017, vol. 2018Janua, no. October, pp. 233–237, 2018. M. A. Fauzi and A. Yuniarti, “Ensemble method for indonesian twitter hate speech detection,” Indones. J. Electr. Eng. Comput. Sci., vol. 11, no. 1, pp. 294–299, 2018. M. O. Ibrohim and I. Budi, “Translated vs non-translated method for multilingual hate speech identification in Twitter,” Int. J. Adv. Sci. Eng. Inf. Technol., vol. 9, no. 4, pp. 1116–1123, 2019. S. Al Faraby, E. R. R. Jasin, A. Kusumaningrum, and Adiwijaya, “Classification of hadith into positive suggestion, negative suggestion, and information,” J. Phys. Conf. Ser., vol. 971, no. 1, 2018. A. Mishra and S. Vishwakarma, “Analysis of TF-IDF Model and its Variant for Document Retrieval,” Proc. - 2015 Int. Conf. Comput. Intell. Commun. Networks, CICN 2015, pp. 772–776, 2016. B. Chen, W. Gu, and J. Hu, “An improved multi-label classification based on label ranking and delicate boundary SVM,” Proc. Int. Jt. Conf. Neural Networks, 2010. I. Aydin, M. Karakose, and E. Akin, “A multi-objective artificial immune algorithm for parameter optimization in support vector machine,” Appl. Soft Comput. J., vol. 11, no. 1, pp. 120–129, 2011. P. Lameski, E. Zdravevski, R. Mingov, and A. Kulakov, “SVM Parameter Tuning with Grid Search and Its Impact on Reduction of Model Over-fitting,” Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), vol. 9437, pp. 464–474, 2015. G. Tsoumakas, I. Katakis, and I. Vlahavas, “Mining Multi-label Data,” Data Min. Knowl. Discov. Handb., pp. 667–685, 2009. K. Dembczynski, W. Waegeman, W. Cheng, and E. Hullermeier, “On Label Dependence In Multi-label Classification,” Int. Conf. Mach. Learn., p. 8, 2010. J. Read and J. Hollmén, “A deep interpretation of classifier chains,” Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), vol. 8819, pp. 251–262, 2014. X. Zhang and Y. LeCun, “Text Understanding from Scratch,” 2015. V. Sanh, L. Debut, J. Chaumond, and T. Wolf, “DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter,” pp. 2–6, 2019.