UTHM Institutional Repository

Improving the accuracy of text document clustering based on syngram algorithm

Omar, Abdul Halim (2015) Improving the accuracy of text document clustering based on syngram algorithm. Masters thesis, Universiti Tun Hussein Onn Malaysia.


Download (1MB)


In most of the literature, Vector Space Model (VSM) represents text document by the frequencies of terms occurred inside the document. In general, the relationship between terms that appear in text document has been ignored by VSM. As a result, two major limitations of term relationship are treated as single and independent entities. The limitation of both concepts, such as Polysemy and Synonymy are definitely significant in determining the content of text document. To overcome both limitations, this study has proposed a combination of WordNet and N-grams named as Syngram algorithm. WordNet is selected as a lookup database to obtain synonym concepts. The capabilities of both concepts are introduced to overcome the Synonymy limitation in text documents into sequences of synonym sets. In the second approach, N-grams have been used in language modeler to construct the term consecutive. This study exploited N-grams to defy Polysemy limitation by altering text features into chunks of terms. The transformation of frequent single term to frequent concept has been proven to improve the accuracy of the text document clustering. An experiment was conducted on reuters50_50 dataset with 10 classes of author names and the performance is compared with existing algorithms. The experiment results showed that the proposed algorithm (65.6%) outperformed the existing algorithm VSM (55.4%), N-grams (53.2%) and WordNet (59%).

Item Type: Thesis (Masters)
Subjects: Q Science > QA Mathematics > QA76 Computer software
Depositing User: Mrs Hasliza Hamdan
Date Deposited: 12 Apr 2016 04:39
Last Modified: 12 Apr 2016 04:39
URI: http://eprints.uthm.edu.my/id/eprint/7878
Statistic Details: View Download Statistic

Actions (login required)

View Item View Item


Downloads per month over past year