This is my MSc thesis, for the MSc (by Research) in Artificial Intelligence. The research examined the use of Rough Set Dimensionality Reduction to simplify the exceptionally massive datasets used in Information Filtering. The methodology was experimentally applied to the filtering of Usolicited Commercial Email (spam) with promising results.
Automated Information Filtering (IF) and Information Retrieval (IR) systems are acquiring increasing prominence. Unfortunately, most attempts to produce effective IF/IR systems are either expensive or unsuccessful. Systems fall prey to the extremely high dimensionality of the domain. Numerous attempts have been made to reduce this dimensionality. However, many involve oversimpliﬁcations and naïve assumptions about the nature of the data, or rely on linguistic aspects like semantics and word lists that cannot possibly be relied upon in multi-cultural domains like the Internet. This thesis proposes a dimensionality reduction approach based on Rough Set Theory: it makes few assumptions about the nature of data it processes and can cope with the subtleties of IF/IR datasets. The technique reduces the dimensionality of IF/IR data by 3.5 orders of magnitude, en- suring no useful information is lost.
Rough Sets and IF/IR techniques are described and related work is reviewed; a taxonomy of IF/IR techniques is drawn up to aid in comparing existing approaches. An experimental system applied to E-mail message classiﬁcation is proposed, designed and implemented. Its results are discussed in detail by verifying the project aims; conclusions are drawn regarding the success of the system and future work is proposed.
Download the Thesis
The thesis is available here, as a PDF document. You could also download the free source code for the tool used to reduce dataset dimensionality, RSAR.