Sentiment Analysis On YouTube Comments Using Word2Vec and Random Forest

Siti Khomsah

doi:10.31315/telematika.v18i1.4493

Authors

Siti Khomsah Institut Teknologi Telkom Purwokerto http://orcid.org/0000-0002-9967-4341

DOI:

https://doi.org/10.31315/telematika.v18i1.4493

Keywords:

youtube comments, sentiment analysis, word2vec, skip-gram, random forest

Abstract

Purpose: This study aims to determine the accuracy of sentiment classification using the Random-Forest, and Word2Vec Skip-gram used for features extraction. Word2Vec is one of the effective methods that represent aspects of word meaning and, it helps to improve sentiment classification accuracy.

Methodology: The research data consists of 31947 comments downloaded from the YouTube channel for the 2019 presidential election debate. The dataset consists of 23612 positive comments and 8335 negative comments. To avoid bias, we balance the amount of positive and negative data using oversampling. We use Skip-gram to extract features word. The Skip-gram will produce several features around the word the context (input word). Each of these features contains a weight. The feature weight of each comment is calculated by an average-based approach. Random Forest is used to building a sentiment classification model. Experiments were carried out several times with different epoch and window parameters. The performance of each model experiment was measured by cross-validation.

Result: Experiments using epochs 1, 5, and 20 and window sizes of 3, 5, and 10, obtain the average accuracy of the model is 90.1% to 91%. However, the results of testing reach an accuracy between 88.77% and 89.05%. But accuracy of the model little bit lower than the accuracy model also was not significant. In the next experiment, it recommended using the number of epochs and the window size greater than twenty epochs and ten windows, so that accuracy increasing significantly.

Value: The number of epoch and window sizes on the Skip-Gram affect accuracy. More and more epoch and window sizes affect increasing the accuracy.

Author Biography

Siti Khomsah, Institut Teknologi Telkom Purwokerto

Program Studi Sains Data

References

A. S. Aribowo, H. Basiron, N. S. Herman, and S. Khomsah, “An evaluation of preprocessing steps and tree-based ensemble machine learning for analysing sentiment on Indonesian youtube comments,” International Journal of Advanced Trends in Computer Science and Engineering, vol. 9, no. 5, 2020, doi: 10.30534/ijatcse/2020/29952020.

R. D. Handayani, K. Kusrini, and H. Al Fatta, “Perbandingan Fitur Ekstraksi Untuk Klasifikasi Emosi Pada Sosial Media,” Jurnal Ilmiah SINUS, vol. 18, no. 2, p. 21, 2020, doi: 10.30646/sinus.v18i2.457.

G. A. Dalaorao, A. M. Sison, and R. P. Medina, “Integrating Collocation as TF-IDF Enhancement to Improve Classification Accuracy,” TSSA 2019 - 13th International Conference on Telecommunication Systems, Services, and Applications, Proceedings, pp. 282–285, 2019, doi: 10.1109/TSSA48701.2019.8985458.

S. Qaiser and R. Ali, “Text Mining: Use of TF-IDF to Examine the Relevance of Words to Documents,” International Journal of Computer Applications, vol. 181, no. 1, pp. 25–29, 2018, doi: 10.5120/ijca2018917395.

K. Ethayarajh, D. Duvenaud, and G. Hirst, “Understanding undesirable word embedding associations,” ACL 2019 - 57th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference, pp. 1696–1705, 2020, doi: 10.18653/v1/p19-1166.

X. Rong, “word2vec Parameter Learning Explained,” pp. 1–21, 2014, [Online]. Available: http://arxiv.org/abs/1411.2738.

T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in vector space,” 1st International Conference on Learning Representations, ICLR 2013 - Workshop Track Proceedings, pp. 1–12, 2013.

M. A. Fauzi, “Word2Vec model for sentiment analysis of product reviews in Indonesian language,” International Journal of Electrical and Computer Engineering (IJECE), vol. 9, no. 1, p. 525, 2019, doi: 10.11591/ijece.v9i1.pp525-530.

E. M. Alshari, A. Azman, S. Doraisamy, N. Mustapha, and M. Alkeshr, “Effective Method for Sentiment Lexical Dictionary Enrichment Based on Word2Vec for Sentiment Analysis,” Proceedings - 2018 4th International Conference on Information Retrieval and Knowledge Management: Diving into Data Sciences, CAMP 2018, pp. 177–181, 2018, doi: 10.1109/INFRKM.2018.8464775.

X. Yang, C. Macdonald, and I. Ounis, “Using word embeddings in Twitter election classification,” Information Retrieval Journal, vol. 21, no. 2–3, pp. 183–207, 2018, doi: 10.1007/s10791-017-9319-5.

N. Cahyana, S. Khomsah, and A. S. Aribowo, “Improving Imbalanced Dataset Classification Using Oversampling and Gradient Boosting,” in Proceeding - 2019 5th International Conference on Science in Information Technology: Embracing Industry 4.0: Towards Innovation in Cyber Physical System, ICSITech 2019, 2019, pp. 217–222, doi: 10.1109/ICSITech46713.2019.8987499.

A. S. Aribowo, Y. Fauziah, H. Basiron, and N. S. Herman, “Proceedings of the 2 nd Faculty of Industrial Technology International Congress International Conference Clustering Emotional Features using Machine Learning in Public Opinion during the 2019 Presidential Candidate Debates in Indonesia,” vol. 6, pp. 2–7, 2020.

S. Khomsah and A. S. Aribowo, “Model Text-Preprocessing Komentar Youtube Dalam Bahasa Indonesia,” vol. 1, no. 10, pp. 1–8, 2021.

C. Mccormick, “Word2Vec Tutorial - The Skip-Gram Model,” 2016.

R. P. Nawangsari, R. Kusumaningrum, and A. Wibowo, “Word2vec for Indonesian sentiment analysis towards hotel reviews: An evaluation study,” Procedia Computer Science, vol. 157, pp. 360–366, 2019, doi: 10.1016/j.procs.2019.08.178.

“No Title.” https://towardsdatascience.com/skip-gram-nlp-context-words-prediction-algorithm-5bbf34f84e0c.