• Alexios
  • AI
  • c
  • research

RSAR treats datasets by removing attributes that are unnecessary for a classification task. It performs greedy feature selection using various versions of the QuickReduct algorithm. It is useful in reducing redundancies in nominally-valued (i.e. discrete) datasets for exploration or as a preprocessing step to training machine learning algorithms on the data. This implementation includes a wide range of optimisations that enable it to process extremely large datasets.

RSAR is available as source, Debian and RPM packages. It's been successfully compiled on GNU/Linux (x86 and SPARC) and Solaris 7 (SPARC). It does have a preference for GNU-based environments, but will happily build on others.

What Does It Do?

In layman's terms, RSAR simplifies datasets. A dataset categorises things into classes based on some of their properties (called ‘attributes’). For example, the well-known mushroom dataset contains descriptions (size, shape, colour, et cetera) of various species of mushrooms, and whether each is edible or not. An AI program trains on part of the dataset and then attempts to guess the category of other, previously unseen objects. This then forms a measure of its success. The vast majority of such systems suffer greatly when dealing with more data than absolutely necessary.

RSAR reads in a dataset, and removes from it all attributes that are irrelevant in categorising the item. Other AI systems can then train on this data without suffering to the same extent. Since most datasets (both research and real-world ones) contain immense amounts of redundancy, using rsar as a pre-processor can be incredibly beneficial — speed-ups of hundreds of times have been measured.

In many cases, RSAR can offer a benefit to humans too — many simplifications help humans gauge what attributes of a problem are most important. Cutting costs is also an option. Building an AI-based monitoring or diagnostic system is much cheaper if irrelevant quantities aren't measured at all (rather than measured by expensive sensors, then eventually discarded by the AI diagnostic system).

As such, RSAR has been designed to produce output for both human and machine, and offers options for doing so in various ways. Please refer to the bundled man page for more information.


Here's an example dataset, saved as test.txt:

1  0  2  2  0
0  1  1  1  2
2  0  0  1  1
1  1  0  2  2
1  0  2  0  1
2  2  0  1  1
2  1  1  1  2
0  1  1  0  1

In the simplest case, executing rsar works like this:

$ rsar --naive test.txt
reduct 1,3 4 1
reduct 2,3 4 1

This indicates there are two reduced versions of the dataset, one using the second and fourth column, the other the third and fourth. Columns start at 0 (unless --base1 is added to the command line). Both reducts are used to classify into the fifth column of the dataset. The last column indicates the γ (0≤γ≤1) of the reduct which indicates how many of the rows in the dataset can be correctly classified given the information in the dataset. Bear in mind that many datasets contain ambiguity and as such the maximum γ with no attributes removed at all can be less than 1.

To make things less cryptic, run rsar as follows:

$ rsar --base1 --statistics -vv test.txt
gamma ( {1} |--> {5} ) = 0
gamma ( {2} |--> {5} ) = 0.125
gamma ( {3} |--> {5} ) = 0
gamma ( {4} |--> {5} ) = 0.25

gamma ( {1,4} |--> {5} ) = 0.375
gamma ( {2,4} |--> {5} ) = 1

A reduct was found.

Selected features: {2,4} |--> {5}, gamma = 1
gamma ( {3,4} |--> {5} ) = 1

A reduct was found.

Selected features: {3,4} |--> {5}, gamma = 1

Passes through the dataset:         8
Total bytes read:                 904
Total tokens (values) read:       280
Total lines (samples) read:        64


This version of RSAR was developed on my own time for my university research work and is released under the terms of the GNU General Public License, version 2.