I have installed the following programs in the directory
~echown/courses/370/c4
.
c4.5
A decision tree learning program.
c4.5rules
A program for converting decision trees
into rules. (not running yet)
c4.5
c4.5
requires three files to run. The name
of each
file must begin with a file stem, which I will denote as stem
. The files have different extensions:
stem.data
is the training data file, with one line for each
training example. Each line has the form
f1, f2, f3, ..., f16, class.where
f1, f2, ...
are the features describing the
example, and class
is the correct class.
stem.test
is the test data file. It has the same
format as the training data file.
stem.names
. This file gives the name and legal
values for each feature. It also gives the names of the various
classes.
To run c4.5
, you give the following linix command:
c4.5 -f stem -u > stem.logThis will first read the
stem.names
file and then read
in all of the training examples in stem.data
. It will
then analyze all of these examples and construct (and then prune) a
decision tree. Finally, it will test the resulting tree on the
examples stored in stem.test
. All output will be
written on the file stem.log
. It will also create two
files stem.trees
and stem.unpruned
, which
contain the pruned and unpruned decision trees in binary format.
C4.5 summarizes its results in a table of the following form:
Evaluation on training data (4000 items): Before Pruning After Pruning ---------------- --------------------------- Size Errors Size Errors Estimate 1085 496(12.4%) 873 546(13.7%) (26.9%) << Evaluation on test data (4000 items): Before Pruning After Pruning ---------------- --------------------------- Size Errors Size Errors Estimate 1085 1232(30.8%) 873 1206(30.1%) (26.9%) <<Most of this should be self-explanatory. The "Size" column gives the number of nodes in the decision tree. The "Errors" column gives the number (and percentage) of examples that are misclassified. The "Estimate" column gives the predicted error rate for new examples (this is the so-called "pessimistic" estimate, and it is computed internally by the tree algorithm). In this case, we see that the unpruned decision tree had 1,085 nodes and made 496 errors on the training data and 1,232 errors (or 30.8%) on the test data. Pruning made the tree significantly smaller (only 873 nodes) and, while it hurt performance on the training data, it slightly improved performance on the test data. The pessimistic estimate (26.9%) was actually a bit optimistic, but not too far off the mark (30.1%).
C4.5 also prints out a confusion matrix that has one row and column for every class. The number shown in row i, column j is the number of examples that we classified into class i but whose true class was j. The perfect confusion matrix has entries along the diagonal only.
c4.5rules
c4.5rules
can be
run to convert the decision tree into a set of rules. To execute the
program, use the following command line:
c4.5rules -f stem -u >> stem.log
C4.5rules
will read the stem.names
,
stem.data
and stem.unpruned
files and append its
output to the file stem.log
. It will evaluate its rules
on the examples in stem.test
. This program can be quite slow.
C4.5rules
displays all of the rules and then summarizes the rule
performance in the following table:
Evaluation on training data (548 items): Rule Size Error Used Wrong Advantage ---- ---- ----- ---- ----- --------- 23 7 16.7% 4 0 (0.0%) 4 (4|0) S 12 3 27.8% 16 4 (25.0%) 9 (12|3) A 4 3 27.0% 35 9 (25.7%) 20 (26|6) B 18 4 19.6% 395 73 (18.5%) 0 (0|0) X 76 4 15.4% 11 1 (9.1%) 0 (0|0) C 81 5 25.0% 18 4 (22.2%) 0 (0|0) C 13 3 14.3% 5 0 (0.0%) 0 (0|0) D Tested 548, errors 133 (24.3%) Evaluation on test data (548 items): Rule Size Error Used Wrong Advantage ---- ---- ----- ---- ----- --------- 23 7 16.7% 3 3 (100.0%) -2 (0|2) S 12 3 27.8% 10 8 (80.0%) -1 (2|3) A 4 3 27.0% 35 16 (45.7%) 12 (19|7) B 18 4 19.6% 409 110 (26.9%) 0 (0|0) X 76 4 15.4% 7 2 (28.6%) 0 (0|0) C 81 5 25.0% 15 4 (26.7%) 0 (0|0) C 13 3 14.3% 2 0 (0.0%) 0 (0|0) D Tested 548, errors 174 (31.8%)
The columns have the following meaning.
In the same directory given above, I have placed some sample data files for C4.5.