

RSAR treats datasets by removing attributes that are unnecessary for a classification task. It performs greedy feature selection using various versions of the QuickReduct algorithm. It is useful in reducing redundancies in nominally-valued (i.e. discrete) datasets for exploration or as a preprocessing step to training machine learning algorithms on the data.
RSAR is available as source, Debian and RPM packages. It's been successfully compiled on GNU/Linux (x86 and SPARC) and Solaris 7 (SPARC). It's provided under the terms of the GNU General Public License, version 2.
|
rsar − Rough Set-based Attribute Reduction |
|
rsar [OPTION]... FILE |
|
rsar is a Rough Set-based Attribute Reduction (aka Feature Selection) implementation. This is an implementation of ideas described, among other places, in the following paper: Qiang Shen and Alexios Chouchoulas, A Modular Approach to Generating Fuzzy Rules with Reduced Attributes for the Monitoring of Complex Systems. Engineering Applications of Artificial Intelligence, 13(3):263-278, 2000. rsar reads in a MIMO (Multiple Input, Multiple Output) dataset, performs RS-based feature selection on it, and returns the selected feature subset. Four versions of the QuickReduct algorithm are supported, QuickReduct, QuickReduct III, QuickReduct IV and QuickReduct V (progressively faster implementations). QuickReduct II is a backward elimination version of QuickReduct and is not supported yet; neither is exhaustive search for reducts. |
|
An exit status of zero indicates success (this means the software ran successfully and does not reflect on the quality of rsar’s results). A non-zero exit status denotes abnormal termination due to an error, which should have been reported to standard error. |
|
The following options specify high-level behaviour. |
|
-n, --naive |
|
Select the naive gamma evaluator. This evaluator follows rough set theory closely, and is provided as the basis for benchmarking the optimised gamma evaluator (which is active in default of this command line option). This option implies --no-gamma-disregard --no-gamma-bail. Specifying otherwise will output a warning. |
|
-3, --III |
|
Enable QuickReduct III. Same as --bail-if-found. |
|
-4, --IV |
|
Enable QuickReduct IV. Same as --III --gamma-disregard. |
|
-5, --V |
|
Enable QuickReduct V. Same as --IV --gamma-bail. |
|
-A, --all-reducts |
|
Disable QuickReduct. Instead, find all possible reducts in the dataset using exhaustive search. This option implies --no-bail-if-found --no-gamma-bail, and in this version, --no-gamma-disregard (a future version will remove this arbitrary limitation). Runtime increases exponentially with the number of features in the dataset. Gamma will be evaluated for every member in the power set of the set of features. That is, for n attributes, two to the n-th power gamma calculations will be attempted. Use with -C to find all reducts up to a given cardinality. With non-trivial datasets, --timeout may also be useful. The --naive gamma evaluator may also be specified. |
|
The following options allow the programme’s behaviour to be controlled in more detail. The high level options above set these en masse for convenience. |
|
--bail-if-found |
|
Stop processing immediately when the first reduct is found. |
|
-g, --gamma=NORMALISED-REAL |
|
Set the target dependency. By default, rsar aims to reach a dependency (gamma value) of 1.0 (or whatever the gamma value is for the entire feature set). In some cases it is useful to stop short of this ideal dependency to conserve time. Please bear in mind that this behaviour violates the definition of a reduct. |
|
If --no-bail-if-found is specified (the default behaviour), then |
|
the best feature subset will be returned among those of equal cardinality. This is best understood with an example. $ rsar -1v -g 0.2 --bail-if-found iris.dat
Selected features: {1} |--> {5}, gamma = 0.393333
However, if --no-bail-if-found is added (or left out altogether, as it is the default), the behaviour changes as follows: $ rsar -1v -g 0.2 iris.dat
Selected features: {1} |--> {5}, gamma = 0.213333
Selected features: {3} |--> {5}, gamma = 0.806667
Selected features: {4} |--> {5}, gamma = 0.746667
Note how there is a better choice (attribute 3) than the first one made in the first run (attribute 1). |
|
--gamma-bail |
|
Stop calculating gamma values as soon as the result is established to be worse than the best one found so far. |
|
--gamma-disregard |
|
In adding further features to the subset, do not examine samples that are known to be discernible given the current feature subset. Only examine indiscernible pairs. This reduces the effective size of the dataset as QuickReduct approaches a solution. |
|
--no-bail-if-found |
|
Don’t stop processing immediately, even if a reduct is found. Process the remaining candidate features for this reduct cardinality, then stop. This gives QuickReduct the chance to locate all reducts of cardinality equal to that of the first reduct located. This is the default behaviour. |
|
--no-gamma-bail |
|
Calculate gamma values regardless of the best gamma value so far (this is the default behaviour). |
|
--no-gamma-disregard |
|
Always examine the entire dataset (this is the default behaviour). |
|
-S, --string-tokens |
|
Instruct the naive gamma evaluator to hash dataset values. This allows string values to be read in and processed, at the cost of some speed and memory. The default behaviour is to only read integer values, and to perform no hashing. This option has no effect on the default, optimised gamma evaluator (it can cope with arbitrary symbolic data). |
|
These options allow the behaviour of rsar to be limited by different criteria. |
|
-C, --cardinality=INTEGER |
|
Do not search for reducts of greater cardinality than the specified one. Any reduct sets obtained will contain at most INTEGER elements. There is no guarantee a reduct will be found if this option is specified. If it is necessary to obtain a reduct, the -g option may also need to be specified to relax the definition of ‘reduct’. This option is useful along with -A, as a means of calculating all reducts up to a given cardinality. |
|
--attribute-timeout=SECONDS |
|
Do not spend more than the specified time looking for the next attribute to add. If the timeout expires, add the best attribute found so far. If none has been found, the entire process is terminated. Please note that this option may not be available on your operating system. |
|
--timeout=SECONDS |
|
Stop when the specified time elapses, yielding the best reduct so far, if one is available. Please not that this option may not be available on your operating system. |
|
The following options describe the dataset to be reduced. |
|
-a, --attributes=INTEGER |
|
Specify the number of conditional attributes in the dataset. This is a natural number. Unless specified, it is automatically calculated by reading FILE’s first datum and subtracting one from the number of fields found. |
|
-c, --classes=INTEGER |
|
Specify the number of decision (class) attributes in the dataset. This is a natural number. Unless specified, it is assumed to be one. If both -a and -c are specified, the specified numbers must add up to the number of fields on each dataset line. |
|
-d, --delimiters=DELIMS |
|
Use DELIMS as field delimiters. The default is "<space><tab>,", which allows rsar to use most types of dataset. |
|
The following options control what information is displayed and how it is formatted. |
|
-0, --base0 |
|
Fields and line numbers start at zero (default). |
|
-1, --base1 |
|
Fields and line numbers start at one. |
|
--dataset-in-ram |
|
Read the dataset into main memory. This is the fastest access method, but may cause heavy memory footprint, especially for extremely large datasets. Datasets are read in their entirety, as immense strings of characters. They are not converted to arrays of integers. This obvious limitation is expected to be rectified in a future version of rsar. |
|
--dataset-mmap |
|
Leave the dataset on disk, access using virtual memory via memory-mapped I/O. If memory-mapped I/O fails, conventional file I/O is used as a fallback. A warning will be issued in such a case. This is the default behaviour for operating systems that offer memory mapping support. |
|
--dataset-on-disk |
|
Leave the dataset on disk, access using file I/O. This is the default behaviour for operating systems lacking memory mapping support. The dataset will be read numerous times. The operating system is expected to offer file caching facilities for this method to work well, otherwise intense disc activity will be observed (and a corresponding speed penalty will be incurred). |
|
-m, --memory |
|
Print memory usage statistics upon completion. Statistics include the integral of allocated memory (whether freed or not). If the operating system supports it, more detailed information is included (usually the amount of main memory occupied by the program at termination). |
|
-o, --output=name |
|
Write output to name instead of standard output. |
|
-r, --ranges |
|
Print sets using ranges, not individual members. For example, the set {1,2,4,5,6,7} would become {1,2,4-7}. This is useful in processing datasets of very high dimensionality, but makes sets less machine-readable. The default is off. |
|
-s, --statistics |
|
Print statistics upon completion. Statistics include the number of passes through the dataset, number of bytes read, number of tokens (attribute values) read, and dataset lines (samples) read. |
|
-t, --time |
|
Print the elapsed running time. |
|
-v, --verbose |
|
Print more information. By default, rsar’s output is meant to be machine-readable for post-processing or gathering statistics. For a more human-readable format, specify -v once. Specified more than three times, this argument causes copious amounts of debugging and status information to be printed (this is almost certainly useless for anything but debugging). |
|
Finally, these are standard options most UNIX software provides. |
|
-?, --help |
|
List all available options and their meanings. |
|
-h, --brief-help, --usage |
|
Display brief help. |
|
-V, --version |
|
Show version of program. |
|
The input dataset must consist of a collection of instances or samples, each on its own line. Attribute values must be separated by one of the characters specified by the -d or --delimiters command-line arguments. The default field delimiters are space (ASCII 32), tab (ASCII 9) and comma (ASCII 44), and should allow most available datasets to be parsed without changes. Attribute values must be integers. String and floating point values are not allowed. See RESTRICTIONS below for a discussion of the reasons. An example dataset would look like: 1 0 2 2 0
0 1 1 1 2
2 0 0 1 1
1 1 0 2 2
1 0 2 0 1
2 2 0 1 1
2 1 1 1 2
0 1 1 0 1
|
|
The most basic invocation of rsar is this (using the bundled iris.dat file): $ rsar iris.dat
reduct 0,1,2 4 1
reduct 0,2,3 4 1
This reduces the dataset iris.dat. A perfect discernibility was attained (the last number is the value of gamma). Two reducts were found: one involves the dataset’s first (0), second (1) and third (2) attributes; the other the first, third and fourth ones. The label (or class attribute, or decision attribute) is the fifth attribute (4). Attribute and sample numbers are zero-based by default. This can be overridden by specifying -1. To make the process and output less cryptic: $ rsar -1svv iris.dat
gamma ( {1} |--> {5} ) = 0.213333
gamma ( {2} |--> {5} ) = 0.126667
gamma ( {3} |--> {5} ) = 0.806667
gamma ( {4} |--> {5} ) = 0.746667
gamma ( {1,3} |--> {5} ) = 0.986667
gamma ( {2,3} |--> {5} ) = 0.966667
gamma ( {3,4} |--> {5} ) = 0.98
gamma ( {1,2,3} |--> {5} ) = 1
A reduct was found.
Selected features: {1,2,3} |--> {5}, gamma = 1
gamma ( {1,3,4} |--> {5} ) = 1
A reduct was found.
Selected features: {1,3,4} |--> {5}, gamma = 1
Passes through the dataset: 10
Total bytes read: 21460
Total tokens (values) read: 6559
Total lines (samples) read: 1500
This shows the processing as it is carried out (-vv, specified twice for increased verbosity), displays attributes starting at 1 (-1) and prints statistics at the end of processing (-s). The first paragraph shows the results of attempting to select the best first attribute (feature). The third attribute is selected, as it provides a gamma of approximately 0.8. The second paragraph shows the addition of the second attribute. The third paragraph adds a third attribute, and reaches two reducts, {1,2,3} and {1,3,4}. On a POSIX operating system, it is possible to get a wide range of statistics by specifying the -m, -s and -t options. This particular example shows rsar yielding all reducts of the Iris dataset on a 1.2 GHz x86-based machine with 512 Mb of main memory, and running Linux: $ rsar -1svAmt iris.dat
Selected features: {1,2,3} |--> {5}, gamma = 1
Selected features: {1,2,3,4} |--> {5}, gamma = 1
Selected features: {1,2,4} |--> {5}, gamma = 1
Selected features: {1,3,4} |--> {5}, gamma = 1
Selected features: {2,3,4} |--> {5}, gamma = 1
Passes through the dataset: 16
Total bytes read: 34336
Total tokens (values) read: 11054
Total lines (samples) read: 2400
User time elapsed: 0.010 seconds (1 jiffies)
System time elapsed: 0.000 seconds (0 jiffies)
Time elapsed: 0.013 seconds
Maximum memory size (RSS): 294912 bytes
Integral memory allocated: 185968 bytes
|
|
Joe Halliwell <joeh@dai.ed.ac.uk> has shown (by counter-example) that rsar is not guaranteed to yield a minimal reduct. Certain datasets form local optima that cause the algorithm to miss the minimal reduct (though it still arrives at a reduct). It has been argued that, in practice, the yielded reduct is close enough to a minimal reduct for all intents and purposes. The exhaustive search (--all-reducts) does not honour the --gamma-disregard option, although it could, and should. This is the result of a bug in the current implementation of the exhaustive search algorithm. The semantics and operation of the --gamma-target option are overly complicated. See restrictions below. There is always one more bug. |
|
Please note that this restriction has been removed in the present version of rsar. The discussion is left here because it may be of service to users interested in additional speed of operation. The default behaviour of the optimised gamma evaluator is to deal with string values in the dataset. However, the default behaviour of the naive gamma evaluator (see --naive) is still to read in integers! To process a dataset with discrete, symbolic values, specify -nS or --naive --string-tokens. When dealing with integer values, you will need to massage datasets to an integer format before they can be accepted by rsar. This restriction was placed for a reason: it accelerates reading the dataset from disk, speeds up its parsing into tokens, drastically simplifies the data structures needed to store the tokens (dynamic, 32-bit integer arrays instead of hashes) and thus increases overall processing speed. Converting a dataset to rsars’s required format needs to be done once, not once per dataset pass. The dataset cannot be a device special, FIFO or pipe. It needs to be a file on a random-access device, as rsar makes numerous passes through the file in the course of reduction. This is yet another conscious design decision to facilitate the processing of extremely large datasets (several tens of thousands of attributes by several hundreds of objects). The operating system should be much more effective (and faster) at caching the dataset in main memory during the course of reduction. |
|
Written by Alexios Chouchoulas. |
|
Report bugs to Alexios Chouchoulas <alexios@dai.ed.ac.uk>. |
|
Copyright © 1999-2003 Alexios Chouchoulas
<alexios@dai.ed.ac.uk>. |