Alexios' Home Page

Rough Set-based Attribute Reduction (RSAR) 

HomeProgramming ► RSAR ►

RSAR treats datasets by removing attributes that are unnecessary for a classification task. It performs greedy feature selection using various versions of the QuickReduct algorithm. It is useful in reducing redundancies in nominally-valued (i.e. discrete) datasets for exploration or as a preprocessing step to training machine learning algorithms on the data.

RSAR is available as source, Debian and RPM packages. It's been successfully compiled on GNU/Linux (x86 and SPARC) and Solaris 7 (SPARC). It's provided under the terms of the GNU General Public License, version 2.

RSAR Manual Page 

RSAR

NAME
SYNOPSIS
DESCRIPTION
EXIT STATUS
OPTIONS
INPUT FORMAT
EXAMPLES
BUGS
RESTRICTIONS
AUTHOR
REPORTING BUGS
COPYRIGHT

NAME

rsar − Rough Set-based Attribute Reduction

SYNOPSIS

rsar [OPTION]... FILE

DESCRIPTION

rsar is a Rough Set-based Attribute Reduction (aka Feature Selection) implementation. This is an implementation of ideas described, among other places, in the following paper:

Qiang Shen and Alexios Chouchoulas, A Modular Approach to Generating Fuzzy Rules with Reduced Attributes for the Monitoring of Complex Systems. Engineering Applications of Artificial Intelligence, 13(3):263-278, 2000.

rsar reads in a MIMO (Multiple Input, Multiple Output) dataset, performs RS-based feature selection on it, and returns the selected feature subset.

Four versions of the QuickReduct algorithm are supported, QuickReduct, QuickReduct III, QuickReduct IV and QuickReduct V (progressively faster implementations). QuickReduct II is a backward elimination version of QuickReduct and is not supported yet; neither is exhaustive search for reducts.

EXIT STATUS

An exit status of zero indicates success (this means the software ran successfully and does not reflect on the quality of rsar’s results). A non-zero exit status denotes abnormal termination due to an error, which should have been reported to standard error.

OPTIONS

The following options specify high-level behaviour.

-n, --naive

Select the naive gamma evaluator. This evaluator follows rough set theory closely, and is provided as the basis for benchmarking the optimised gamma evaluator (which is active in default of this command line option). This option implies --no-gamma-disregard --no-gamma-bail. Specifying otherwise will output a warning.

-3, --III

Enable QuickReduct III. Same as --bail-if-found.

-4, --IV

Enable QuickReduct IV. Same as --III --gamma-disregard.

-5, --V

Enable QuickReduct V. Same as --IV --gamma-bail.

-A, --all-reducts

Disable QuickReduct. Instead, find all possible reducts in the dataset using exhaustive search. This option implies --no-bail-if-found --no-gamma-bail, and in this version, --no-gamma-disregard (a future version will remove this arbitrary limitation). Runtime increases exponentially with the number of features in the dataset. Gamma will be evaluated for every member in the power set of the set of features. That is, for n attributes, two to the n-th power gamma calculations will be attempted. Use with -C to find all reducts up to a given cardinality. With non-trivial datasets, --timeout may also be useful. The --naive gamma evaluator may also be specified.

The following options allow the programme’s behaviour to be controlled in more detail. The high level options above set these en masse for convenience.

--bail-if-found

Stop processing immediately when the first reduct is found.

-g, --gamma=NORMALISED-REAL

Set the target dependency. By default, rsar aims to reach a dependency (gamma value) of 1.0 (or whatever the gamma value is for the entire feature set). In some cases it is useful to stop short of this ideal dependency to conserve time. Please bear in mind that this behaviour violates the definition of a reduct.

If --no-bail-if-found is specified (the default behaviour), then

the best feature subset will be returned among those of equal cardinality. This is best understood with an example.

$ rsar -1v -g 0.2 --bail-if-found iris.dat
Selected features: {1} |--> {5}, gamma = 0.393333

However, if --no-bail-if-found is added (or left out altogether, as it is the default), the behaviour changes as follows:

$ rsar -1v -g 0.2 iris.dat
Selected features: {1} |--> {5}, gamma = 0.213333
Selected features: {3} |--> {5}, gamma = 0.806667
Selected features: {4} |--> {5}, gamma = 0.746667

Note how there is a better choice (attribute 3) than the first one made in the first run (attribute 1).

--gamma-bail

Stop calculating gamma values as soon as the result is established to be worse than the best one found so far.

--gamma-disregard

In adding further features to the subset, do not examine samples that are known to be discernible given the current feature subset. Only examine indiscernible pairs. This reduces the effective size of the dataset as QuickReduct approaches a solution.

--no-bail-if-found

Don’t stop processing immediately, even if a reduct is found. Process the remaining candidate features for this reduct cardinality, then stop. This gives QuickReduct the chance to locate all reducts of cardinality equal to that of the first reduct located. This is the default behaviour.

--no-gamma-bail

Calculate gamma values regardless of the best gamma value so far (this is the default behaviour).

--no-gamma-disregard

Always examine the entire dataset (this is the default behaviour).

-S, --string-tokens

Instruct the naive gamma evaluator to hash dataset values. This allows string values to be read in and processed, at the cost of some speed and memory. The default behaviour is to only read integer values, and to perform no hashing. This option has no effect on the default, optimised gamma evaluator (it can cope with arbitrary symbolic data).

These options allow the behaviour of rsar to be limited by different criteria.

-C, --cardinality=INTEGER

Do not search for reducts of greater cardinality than the specified one. Any reduct sets obtained will contain at most INTEGER elements. There is no guarantee a reduct will be found if this option is specified. If it is necessary to obtain a reduct, the -g option may also need to be specified to relax the definition of ‘reduct’. This option is useful along with -A, as a means of calculating all reducts up to a given cardinality.

--attribute-timeout=SECONDS

Do not spend more than the specified time looking for the next attribute to add. If the timeout expires, add the best attribute found so far. If none has been found, the entire process is terminated. Please note that this option may not be available on your operating system.

--timeout=SECONDS

Stop when the specified time elapses, yielding the best reduct so far, if one is available. Please not that this option may not be available on your operating system.

The following options describe the dataset to be reduced.

-a, --attributes=INTEGER

Specify the number of conditional attributes in the dataset. This is a natural number. Unless specified, it is automatically calculated by reading FILE’s first datum and subtracting one from the number of fields found.

-c, --classes=INTEGER

Specify the number of decision (class) attributes in the dataset. This is a natural number. Unless specified, it is assumed to be one. If both -a and -c are specified, the specified numbers must add up to the number of fields on each dataset line.

-d, --delimiters=DELIMS

Use DELIMS as field delimiters. The default is "<space><tab>,", which allows rsar to use most types of dataset.

The following options control what information is displayed and how it is formatted.

-0, --base0

Fields and line numbers start at zero (default).

-1, --base1

Fields and line numbers start at one.

--dataset-in-ram

Read the dataset into main memory. This is the fastest access method, but may cause heavy memory footprint, especially for extremely large datasets. Datasets are read in their entirety, as immense strings of characters. They are not converted to arrays of integers. This obvious limitation is expected to be rectified in a future version of rsar.

--dataset-mmap

Leave the dataset on disk, access using virtual memory via memory-mapped I/O. If memory-mapped I/O fails, conventional file I/O is used as a fallback. A warning will be issued in such a case. This is the default behaviour for operating systems that offer memory mapping support.

--dataset-on-disk

Leave the dataset on disk, access using file I/O. This is the default behaviour for operating systems lacking memory mapping support. The dataset will be read numerous times. The operating system is expected to offer file caching facilities for this method to work well, otherwise intense disc activity will be observed (and a corresponding speed penalty will be incurred).

-m, --memory

Print memory usage statistics upon completion. Statistics include the integral of allocated memory (whether freed or not). If the operating system supports it, more detailed information is included (usually the amount of main memory occupied by the program at termination).

-o, --output=name

Write output to name instead of standard output.

-r, --ranges

Print sets using ranges, not individual members. For example, the set {1,2,4,5,6,7} would become {1,2,4-7}. This is useful in processing datasets of very high dimensionality, but makes sets less machine-readable. The default is off.

-s, --statistics

Print statistics upon completion. Statistics include the number of passes through the dataset, number of bytes read, number of tokens (attribute values) read, and dataset lines (samples) read.

-t, --time

Print the elapsed running time.

-v, --verbose

Print more information. By default, rsar’s output is meant to be machine-readable for post-processing or gathering statistics. For a more human-readable format, specify -v once. Specified more than three times, this argument causes copious amounts of debugging and status information to be printed (this is almost certainly useless for anything but debugging).

Finally, these are standard options most UNIX software provides.

-?, --help

List all available options and their meanings.

-h, --brief-help, --usage

Display brief help.

-V, --version

Show version of program.

INPUT FORMAT

The input dataset must consist of a collection of instances or samples, each on its own line. Attribute values must be separated by one of the characters specified by the -d or --delimiters command-line arguments. The default field delimiters are space (ASCII 32), tab (ASCII 9) and comma (ASCII 44), and should allow most available datasets to be parsed without changes.

Attribute values must be integers. String and floating point values are not allowed. See RESTRICTIONS below for a discussion of the reasons.

An example dataset would look like:

             1  0  2  2  0
             0  1  1  1  2
             2  0  0  1  1
             1  1  0  2  2
             1  0  2  0  1
             2  2  0  1  1
             2  1  1  1  2
             0  1  1  0  1

EXAMPLES

The most basic invocation of rsar is this (using the bundled iris.dat file):

       $ rsar iris.dat
       reduct 0,1,2 4 1
       reduct 0,2,3 4 1

This reduces the dataset iris.dat. A perfect discernibility was attained (the last number is the value of gamma). Two reducts were found: one involves the dataset’s first (0), second (1) and third (2) attributes; the other the first, third and fourth ones. The label (or class attribute, or decision attribute) is the fifth attribute (4). Attribute and sample numbers are zero-based by default. This can be overridden by specifying -1. To make the process and output less cryptic:

       $ rsar -1svv iris.dat
       gamma ( {1} |--> {5} ) = 0.213333
       gamma ( {2} |--> {5} ) = 0.126667
       gamma ( {3} |--> {5} ) = 0.806667
       gamma ( {4} |--> {5} ) = 0.746667

      gamma ( {1,3} |--> {5} ) = 0.986667
       gamma ( {2,3} |--> {5} ) = 0.966667
       gamma ( {3,4} |--> {5} ) = 0.98

      gamma ( {1,2,3} |--> {5} ) = 1

      A reduct was found.

      Selected features: {1,2,3} |--> {5}, gamma = 1
       gamma ( {1,3,4} |--> {5} ) = 1

      A reduct was found.

      Selected features: {1,3,4} |--> {5}, gamma = 1

      Passes through the dataset:        10
       Total bytes read:               21460
       Total tokens (values) read:      6559
       Total lines (samples) read:      1500

This shows the processing as it is carried out (-vv, specified twice for increased verbosity), displays attributes starting at 1 (-1) and prints statistics at the end of processing (-s).

The first paragraph shows the results of attempting to select the best first attribute (feature). The third attribute is selected, as it provides a gamma of approximately 0.8. The second paragraph shows the addition of the second attribute. The third paragraph adds a third attribute, and reaches two reducts, {1,2,3} and {1,3,4}.

On a POSIX operating system, it is possible to get a wide range of statistics by specifying the -m, -s and -t options. This particular example shows rsar yielding all reducts of the Iris dataset on a 1.2 GHz x86-based machine with 512 Mb of main memory, and running Linux:

       $ rsar -1svAmt iris.dat
       Selected features: {1,2,3} |--> {5}, gamma = 1
       Selected features: {1,2,3,4} |--> {5}, gamma = 1
       Selected features: {1,2,4} |--> {5}, gamma = 1
       Selected features: {1,3,4} |--> {5}, gamma = 1
       Selected features: {2,3,4} |--> {5}, gamma = 1

      Passes through the dataset:        16
       Total bytes read:               34336
       Total tokens (values) read:     11054
       Total lines (samples) read:      2400
       User time elapsed:              0.010 seconds (1 jiffies)
       System time elapsed:            0.000 seconds (0 jiffies)
       Time elapsed:                   0.013 seconds
       Maximum memory size (RSS):     294912 bytes
       Integral memory allocated:     185968 bytes

BUGS

Joe Halliwell <joeh@dai.ed.ac.uk> has shown (by counter-example) that rsar is not guaranteed to yield a minimal reduct. Certain datasets form local optima that cause the algorithm to miss the minimal reduct (though it still arrives at a reduct). It has been argued that, in practice, the yielded reduct is close enough to a minimal reduct for all intents and purposes.

The exhaustive search (--all-reducts) does not honour the --gamma-disregard option, although it could, and should. This is the result of a bug in the current implementation of the exhaustive search algorithm.

The semantics and operation of the --gamma-target option are overly complicated.

See restrictions below.

There is always one more bug.

RESTRICTIONS

Please note that this restriction has been removed in the present version of rsar. The discussion is left here because it may be of service to users interested in additional speed of operation. The default behaviour of the optimised gamma evaluator is to deal with string values in the dataset. However, the default behaviour of the naive gamma evaluator (see --naive) is still to read in integers! To process a dataset with discrete, symbolic values, specify -nS or --naive --string-tokens.

When dealing with integer values, you will need to massage datasets to an integer format before they can be accepted by rsar.

This restriction was placed for a reason: it accelerates reading the dataset from disk, speeds up its parsing into tokens, drastically simplifies the data structures needed to store the tokens (dynamic, 32-bit integer arrays instead of hashes) and thus increases overall processing speed. Converting a dataset to rsars’s required format needs to be done once, not once per dataset pass.

The dataset cannot be a device special, FIFO or pipe. It needs to be a file on a random-access device, as rsar makes numerous passes through the file in the course of reduction. This is yet another conscious design decision to facilitate the processing of extremely large datasets (several tens of thousands of attributes by several hundreds of objects). The operating system should be much more effective (and faster) at caching the dataset in main memory during the course of reduction.

AUTHOR

Written by Alexios Chouchoulas.

REPORTING BUGS

Report bugs to Alexios Chouchoulas <alexios@dai.ed.ac.uk>.

COPYRIGHT

Copyright © 1999-2003 Alexios Chouchoulas <alexios@dai.ed.ac.uk>.
This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.