CMSC 25000 - Artificial Intelligence
Winter 2004
Homework #5: Due February 17, 2004
Goals
Through this assignment you will:
- Build decision tree systems.
- Consider the differences between decision trees and the other
classifiers studied so far.
Problem 1
Picking Concentrations: Redux
Note: The classification task in this homework
is similar (well, okay, identical) to that in the nearest neighbor
homework. This similarity will allow you to compare the
two techniques more directly.
You've been told many time how important it is to pick the
right concentration for your interests and skills - and, of
course, your future employability. After many visits to
guidance counselors with varying degrees of success and after
studying some machine learning techniques in CMSC25000, you
conclude that there should be some way to distill their knowledge
into an automatic process. After all, gaining experience is
about seeing many examples and determining what works and what
doesn't.
You decide to take a first pass at designing such a system to
predict the concentration of a student based on their performance
in some preliminary classes. You decide to focus on a few classes
of concentrations first, rather than trying to make fine-grained
decisions. You want to route students to one of the four following
areas: physical sciences, biological sciences, business,
and humanities. You use grades from courses typically taken in the
first few terms broken down into five classes:
- Math: Calculus I, Calculus II, Discrete Math
- Science: Physics I, Chemistry I, Biology I
- Languages: French I, French II, Spanish I
- Business: Micro-economics, Macro-economics, Marketing
- Humanities: Modern European History, English Composition, Music
Decision Trees
You decide to experiment with decision tree classifiers
that your instructor claimed were less sensitive to irrelevant
attributes and even selected the most useful features to
incorporate in the classification process.
Note: Please take careful note of the data structure that is used
to encode the training samples. It will have a major impact on
the code you write. Errors can lead to 14 hour runtimes that
produce incorrect output!!!!
Part A
The basic code for training and prediction using identification
trees is in id3.scm. However, again, a key
component has not yet been implemented - the function that
computes the average disorder (entropy) over the different
branches of a proposed test. You should train and test on
the newdata2.trn and newdata2.tst
samples used above.
Write and hand in the code which implements this function.
Part B
In decision trees, which type of feature would be most
effective for representing grades - number grades or letter
grades, or would either be acceptable? Briefly
explain your answer.
Part C
The function (show-tree) displays the structure
of the decision tree. Examine the tree produced by
the training data in the original set.
What features are relevant/irrelevant?
Part D
You've heard that one of the advantages of using decision
trees is that they provide a perspicuous interpretation as
a set of if-then rules.
Produce a list of rules consistent with the structure of
the identification tree above and produce the classifications
with highest confidence (based on the number of samples leading to
that classification.) In other words, you need only produce rules
for paths that classify more than 200 samples.