An improved framework for content and link-based web spam detection: a combined approach

Shahzad, Asim (2021) An improved framework for content and link-based web spam detection: a combined approach. Doctoral thesis, Universiti Tun Hussein Onn Malaysia.

Text (Copyright Declaration)
ASIM SHAHZAD - declaration.pdf
Restricted to Repository staff only
Download (400kB) | Request a copy

Preview

Text (24 pages)
ASIM SHAHZAD - 24p.pdf
Download (1MB) | Preview

Text (Full Text)
ASIM SHAHZAD - fulltext.pdf
Restricted to Registered users only
Download (1MB) | Request a copy

Abstract

In the modern digital era, the Web has been utilized for searching information by using different search engines (SE) as a tool. However, web spammers misuse the web for financial benefits by ranking the irrelevant and spam web pages higher than relevant pages in the search engine's results pages (SERPs) by using web spamming techniques. Furthermore, those top-ranked unrelated web pages contain insufficient or inappropriate information for the user. In addition, web spamming techniques dramatically affect the quality of the search engine. Researchers introduced several web spam detection techniques such as content-based features, link-based features, label propagation, label refinement, click-based web spamming detection, and real-time web spam detection. However, identifying all spam pages on the Web with high accuracy is still remains unsolved. This work proposes a content-based web spam detection framework, link-based web spam detection framework, and a combined approach to identify both types of web spams with high accuracy that can detect the newly evolved link pyramid. The content-based web spam detection framework uses three proposed and two improved content-based algorithms for web spam detection. The link-based web spam detection framework initially exposed the relationship network behind the link spamming and then used the paid-links database algorithm, spam signals algorithm, and improved link farms algorithm for link-based web spam identification. Finally, the combination of both content and link-based frameworks enhance the accuracy of web spam detection. The proposed combined approach's performance has been evaluated and compared with the J48 classifier, C4.5 decision tree classifier, SVM classifier, and heuristic combined approach. Some experiments were conducted to obtain the threshold values using the proposed collection architecture on well-known datasets WEB SPAM-UK2006 and WEB SPAM-UK2007. The results show that the proposed methods outperform other methods with 82.1% precision and an F-measure of 80.6% to illustrate the proposed framework's effectiveness and applicability.

Item Type:	Thesis (Doctoral)
Subjects:	Q Science > QA Mathematics > QA71-90 Instruments and machines > QA76.75-76.765 Computer software
Divisions:	Faculty of Computer Science and Information Technology > Department of Web Technology
Depositing User:	Mrs. Nur Nadia Md. Jurimi
Date Deposited:	11 Oct 2021 07:58
Last Modified:	11 Oct 2021 07:58
URI:	http://eprints.uthm.edu.my/id/eprint/1777

Actions (login required)

View Item