Decision Trees

I have installed the following programs in the directory


~echown/courses/370/c4

c4.5 A decision tree learning program.
c4.5rules A program for converting decision trees into rules. (not running yet)

Running `c4.5`

The program c4.5 requires three files to run. The name of each file must begin with a file stem, which I will denote as stem. The files have different extensions:

stem.data is the training data file, with one line for each training example. Each line has the form
```
 f1, f2, f3, ..., f16, class. 
```
where f1, f2, ... are the features describing the example, and class is the correct class.
stem.test is the test data file. It has the same format as the training data file.
stem.names. This file gives the name and legal values for each feature. It also gives the names of the various classes.

To run c4.5, you give the following linix command:

 c4.5 -f stem -u > stem.log

This will first read the stem.names file and then read in all of the training examples in stem.data. It will then analyze all of these examples and construct (and then prune) a decision tree. Finally, it will test the resulting tree on the examples stored in stem.test. All output will be written on the file stem.log. It will also create two files stem.trees and stem.unpruned, which contain the pruned and unpruned decision trees in binary format.

C4.5 summarizes its results in a table of the following form:

Evaluation on training data (4000 items):

	 Before Pruning           After Pruning
	----------------   ---------------------------
	Size      Errors   Size      Errors   Estimate

	1085  496(12.4%)    873  546(13.7%)    (26.9%)   <<

Evaluation on test data (4000 items):

	 Before Pruning           After Pruning
	----------------   ---------------------------
	Size      Errors   Size      Errors   Estimate

	1085  1232(30.8%)    873  1206(30.1%)    (26.9%)   <<

Most of this should be self-explanatory. The "Size" column gives the number of nodes in the decision tree. The "Errors" column gives the number (and percentage) of examples that are misclassified. The "Estimate" column gives the predicted error rate for new examples (this is the so-called "pessimistic" estimate, and it is computed internally by the tree algorithm). In this case, we see that the unpruned decision tree had 1,085 nodes and made 496 errors on the training data and 1,232 errors (or 30.8%) on the test data. Pruning made the tree significantly smaller (only 873 nodes) and, while it hurt performance on the training data, it slightly improved performance on the test data. The pessimistic estimate (26.9%) was actually a bit optimistic, but not too far off the mark (30.1%).

C4.5 also prints out a confusion matrix that has one row and column for every class. The number shown in row i, column j is the number of examples that we classified into class i but whose true class was j. The perfect confusion matrix has entries along the diagonal only.

Running `c4.5rules`

After C4.5 has been run, the program c4.5rules can be run to convert the decision tree into a set of rules. To execute the program, use the following command line:

 c4.5rules -f stem -u >> stem.log

C4.5rules will read the stem.names,


stem.data

and stem.unpruned files and append its output to the file stem.log. It will evaluate its rules on the examples in stem.test. This program can be quite slow.

C4.5rules displays all of the rules and then summarizes the rule performance in the following table:

Evaluation on training data (548 items):

Rule  Size  Error  Used  Wrong          Advantage
----  ----  -----  ----  -----          ---------
  23     7  16.7%     4      0  (0.0%)	     4  (4|0)   S
  12     3  27.8%    16      4  (25.0%)	     9  (12|3)  A 
   4     3  27.0%    35      9  (25.7%)	    20  (26|6)  B
  18     4  19.6%   395     73  (18.5%)	     0  (0|0)   X
  76     4  15.4%    11      1  (9.1%)	     0  (0|0)   C
  81     5  25.0%    18      4  (22.2%)	     0  (0|0)   C
  13     3  14.3%     5      0  (0.0%)	     0  (0|0)   D

Tested 548, errors 133 (24.3%)

Evaluation on test data (548 items):

Rule  Size  Error  Used  Wrong          Advantage
----  ----  -----  ----  -----          ---------
  23     7  16.7%     3      3  (100.0%)    -2  (0|2)   S
  12     3  27.8%    10      8  (80.0%)	    -1  (2|3)   A 
   4     3  27.0%    35     16  (45.7%)	    12  (19|7)  B
  18     4  19.6%   409    110  (26.9%)	     0  (0|0)   X
  76     4  15.4%     7      2  (28.6%)	     0  (0|0)   C
  81     5  25.0%    15      4  (26.7%)	     0  (0|0)   C
  13     3  14.3%     2      0  (0.0%)	     0  (0|0)   D

Tested 548, errors 174 (31.8%)

The columns have the following meaning.

"Rule": The number of the rule. There is one row for each rule.
"Size": The number of tests in the rule.
"Error": The estimated error rate for this rule.
"Used": The number of times this rule was used to classify examples in the data set.
"Wrong": The number of times the rule made an error (also expressed as a percentage).
"Advantage": A heuristic quantity used by the rule algorithms to select and prune rules.
The class given in the conclusion part of the rule.

Here we see that the rules achieved an error rate of 31.8% on the test data. The rules are grouped according to their output classes. Furthermore, the classes are ordered. The rules are applied in the order given. If none of the rules applies to an example, then the example is assigned to the "default" class.

In the same directory given above, I have placed some sample data files for C4.5.

Eric Chown, echown@bowdoin.edu

Running c4.5

Running c4.5rules

Running `c4.5`

Running `c4.5rules`