DETEKSI BAHASA UNTUK DOKUMEN TEKS BERBAHASA INDONESIA

Amir Hamzah

Abstract


In the multi language environment corpus such as Internet, the information retrieval system has faced difficulties that caused by the mixture of language document response of single query request that do not match the user need. One approach to handle this problem is by designing cross-language search engine. On the other hand this solution is no need for the user that only hoped the document answer only in one language such as Bahasa Indonesia. In the second case the solution is by designing search engine in certain language. In the construction of special language search engine in multi language environment, a critical step is language detection of the document being analyzed. This research was aimed to study comparison of several methods of language detection based on N-gram, i.e. unigram, bigram and trigram. Several news text documents in Bahasa Indonesia from 100 documents until 3000 document, two academic document collections of 88 and 450 documents and two abstract collection and full paper collection in English, each of those is 40 documents, were used as test collection. The results showed that unigram, bigram and trigram were good parameter to detect the language of documents. Among those methods, bigram was the best in time complexity and accuracy


References


Adriani, M, 2002, Evaluating Indonesian Online Resources for Cross Language Information Retrieval, SIGIR’2002, International Conference on Research and Development in Information Retrieval, Agustus 2002.

Asian, J., H. E. Williams, and S. M. M. Tahaghoghi, 2004, Tesbed for Indonesian Text Retrieval, 9th Australian Document Computing Symposiom, Melbourne December, 13 2004.

Bastrup, S. and C. Popper, 2003, Language Detection Based on Unigram Analysis and Decision Trees, www.citeseer.ist.psu.edu/bastrup03language.html

Batchelder, E.O., 1992, A Learnbing Experience: Training an Artificial Neural Network to Discriminate Languages. Unpublished Technical Report, 1992.

Hamzah, A., 2009, Penerapan Clustering Dokumen untuk Meningkatkan Efektifitas Sistem Temu Kembali Informasi Dokumen Berbahasa Indonesia, Disertasi Jurusan Teknik Elektro, Fakultas Teknik, Universitas Gadjah Mada, Yogyakarta.

Nazief, B., 2000, Development of Computational Linguistic Research: a Challenge for Indonesia”, Computer Science Center, University of Indonesia.

Sibun, P. And Spits, A.L., 1994, Language Determination: Natural Language Processing from Scanned Document Image, Fuji Xerox, Palo Alto Laboratory.

Sibun, P. and J.C. Reynar, 1996, Language Identification: Examining the Issues, The 5th Symposium on Document Analysis and Information Retrieval, Las Vegas , Nevada ,U.S.A., pages: 125-135.

Vega, V.B., and S. Bressan, 2000, Continuous-Learning Weighted-Trigram Approach for Indonesian Language Distinction: A Preliminary Study, School of Computing, natinal University of Singapore.

Vega, V. B. , 2001, Information Retrieval for the Indonesian Language, Master's thesis, National University of Singapore.

Xu, W., X. Liu, and Y. Gong, 2003, Document Clustering Based on Non-Negative Matrix Factorization, SIGIR’03, 28 Juli-1 Agustus, Toronto, Canada.

Ziegler, D.V., 1992, The Automatic Identification of Languages Using Linguistic Recognition Signial, Dissertation, State University of New York at Baffalo.


Refbacks

  • There are currently no refbacks.