An improved imputation method based on Fuzzy C-Means and particle swarm optimization for treating missing data

Samat, Nurul Ashikin (2017) An improved imputation method based on Fuzzy C-Means and particle swarm optimization for treating missing data. Masters thesis, Universiti Tun Hussein Onn Malaysia.

[img]PDF
590Kb

Abstract

Data mining techniques are used in various industries, including database marketing, web analysis, information retrieval and bioinformatics to gain a better knowledge extraction. However, if data mining techniques are applied on real datasets, a problem that often comes up is that missing values occur in the datasets. Since the missing values may confuse the data mining process and causing the knowledge extracted unreliable, there is a need to handle the missing values. Therefore, researchers are coming out with imputation methods in the preprocessing stage. Although there are many imputation methods such as Mean, k-Nearest Neighbor (k- NN) and Fuzzy C-Means are implemented by other researchers, accuracy for the replace values is still in infancy. In this study, an imputation based on FCM and Particle Swarm Optimization (PSO) has been developed to get better imputation values. FCM has ability to cluster the data into two or more subsets with the different membership values and gives better coverage to find the correlation between the dataset. While, PSO is a swarm optimization algorithm that effectively find the optimum imputation values with less parameters to adjust. Then, FCMPSO was trained with seven artificial missing ratios from 1% to 30% for Cleveland Heart Disease dataset and real missing values in Framingham Heart Disease dataset to get the complete dataset. Then, the complete dataset was trained with Decision Tree algorithm to observe the performance in terms of accuracy. The FCMPSO results gives a better RMSE value for 30% missing ratios with 0.0237 compared to Mean, k- NN, and FCM with 0.0250, 0.0402 and 0.0249 respectively. Next, the analysis of proposed imputation on classification accuracy shows an improvement with 81.67% for Cleveland Heart Disease and 86.3% for Framingham Heart Disease compared to other imputation methods. Based on the results, the imputation values are slightly accurate compared to other imputation methods and therefore, increased the accuracy of Decision Tree classification.

Item Type:Thesis (Masters)
Subjects:Q Science > QA Mathematics > QA76 Computer software
ID Code:9875
Deposited By:Mr. Mohammad Shaifulrip Ithnin
Deposited On:27 Mar 2018 12:19
Last Modified:27 Mar 2018 12:19

Repository Staff Only: item control page