UTHM Institutional Repository

Modeling unstructured document using N-gram consecutive and wordnet dictionary

Omar, Abdul Halim and Mohd Salleh, Mohd Najib (2013) Modeling unstructured document using N-gram consecutive and wordnet dictionary. International Journal of Computer Science Issues, 10 (2). pp. 496-504. ISSN 1694-0814


Download (1MB)


The main issue in Text Document Clustering (TDC) is document similarity. In order to measure the similarity, documents must be transformed into numerical values. Vector Space Model (VSM) is one of technique capable to convert document into numerical value. In VSM documents was represented by the frequencies of term inside document and it works like a Bag of Word (BOW). BOW has resulted two major problems since it ignores the term relationship by treating term as single and independent. Both problems stated as Polysemy and Synonymity concept which is reflected to the relationship of terms. This study was combined WordNet and N-gram to overcome both problems. By modifying document features from single term into Polysemy and Synonymity concept, it has improved VSM performance. There are four steps in experimental. Text documents selection, preprocessing, applying clustering and cluster evaluation using F-measure. With dataset reuters50_50 obtained from UCI repository the experiment was successful and the result promising.

Item Type: Article
Uncontrolled Keywords: TDC; TD; VSM; Polysemy; Synonymity; WordNet; Ngram; K-Means Synset Syngram; Cosine Similarity and FMeasure
Subjects: Q Science > QA Mathematics > QA76 Computer software
Divisions: Faculty of Computer Science and Information Technology > Department of Software Engineering
Depositing User: Normajihan Abd. Rahman
Date Deposited: 10 Jan 2017 08:59
Last Modified: 10 Jan 2017 08:59
URI: http://eprints.uthm.edu.my/id/eprint/8211
Statistic Details: View Download Statistic

Actions (login required)

View Item View Item


Downloads per month over past year