University of Chicago: Computer Science: CMSC 25000: Winter 2003: Homework #6

CMSC 25000 - Artificial Intelligence
Winter 2004
Homework #5: Due February 17, 2004

Goals

Through this assignment you will:

Build decision tree systems.
Consider the differences between decision trees and the other classifiers studied so far.

Problem 1

Picking Concentrations: Redux

Note: The classification task in this homework is similar (well, okay, identical) to that in the nearest neighbor homework. This similarity will allow you to compare the two techniques more directly.

You've been told many time how important it is to pick the right concentration for your interests and skills - and, of course, your future employability. After many visits to guidance counselors with varying degrees of success and after studying some machine learning techniques in CMSC25000, you conclude that there should be some way to distill their knowledge into an automatic process. After all, gaining experience is about seeing many examples and determining what works and what doesn't. You decide to take a first pass at designing such a system to predict the concentration of a student based on their performance in some preliminary classes. You decide to focus on a few classes of concentrations first, rather than trying to make fine-grained decisions. You want to route students to one of the four following areas: physical sciences, biological sciences, business, and humanities. You use grades from courses typically taken in the first few terms broken down into five classes:

Math: Calculus I, Calculus II, Discrete Math
Science: Physics I, Chemistry I, Biology I
Languages: French I, French II, Spanish I
Business: Micro-economics, Macro-economics, Marketing
Humanities: Modern European History, English Composition, Music

Decision Trees

You decide to experiment with decision tree classifiers that your instructor claimed were less sensitive to irrelevant attributes and even selected the most useful features to incorporate in the classification process. Note: Please take careful note of the data structure that is used to encode the training samples. It will have a major impact on the code you write. Errors can lead to 14 hour runtimes that produce incorrect output!!!!

Part A

The basic code for training and prediction using identification trees is in id3.scm. However, again, a key component has not yet been implemented - the function that computes the average disorder (entropy) over the different branches of a proposed test. You should train and test on the newdata2.trn and newdata2.tst samples used above. Write and hand in the code which implements this function.

Part B

In decision trees, which type of feature would be most effective for representing grades - number grades or letter grades, or would either be acceptable? Briefly explain your answer.

Part C

The function (show-tree) displays the structure of the decision tree. Examine the tree produced by the training data in the original set. What features are relevant/irrelevant?

Part D

You've heard that one of the advantages of using decision trees is that they provide a perspicuous interpretation as a set of if-then rules. Produce a list of rules consistent with the structure of the identification tree above and produce the classifications with highest confidence (based on the number of samples leading to that classification.) In other words, you need only produce rules for paths that classify more than 200 samples.

CMSC 25000 - Artificial Intelligence Winter 2004 Homework #5: Due February 17, 2004