University of Chicago: Computer Science: CMSC 25000: Winter 2004: Homework #4

CMSC 25000 - Artificial Intelligence
Winter 2004
Homework #4: Due February 3, 2004

Goals

Through this assignment you will:

Explore the use of simulated evolution to search complex spaces for optimal solutions.
Analyze the effect of different evolution strategies and fitness measures on genetic algorithm success
Learn how to formulate search problems as genetic algorithms.
Build memory-based learning systems.
Explore the issues that arise in such systems including feature selection and handling of missing features.

Background:

Genetic algorithms build on a biological analogy as mechanism for local search and learning. The inspiration lies in Darwin's theory of evolution by natural selection. In natural evolution, individuals have traits carried by genes on chromosomes that are passed on to offspring. Variation in genetic make-up arise through mutation - copying errors - and crossover combining material from different chromosomes. The fittest individuals survive to have more offspring.

Genetic algorithms map the structure of the search problem onto evolutionary analogs. The features or parameters of the search space map to genes and are grouped into chromosomes. Mutations correspond to random changes in some number of feature values. Crossover likewise corresponds to swapping feature values for some subset of genes between a pair of chromosomes. Finally one has to define a measure of fitness. We explored the standard method, rank method, and rank-space methods of computing fitness as the probability of an individual's survival to the next generation. These measures draw on assessments of quality, diversity or both.

Problem 1: Mastermind

Mastermind is a game in which one player creates a pattern of colored pegs, which they hide from the other player. The other player's goal is to guess the hidden pattern of pegs based on limited feedback (number of correct colors, number of correctly positioned pegs). Here we will use GAs to evolve a solution to the mastermind matching game.

In this variant of the game, the pegs will take on the colors of the rainbow: red, orange, yellow, green, blue, indigo, violet. Since we have all the processing power of a computer to solve this problem, we will try to guess a longer than typical string, 20 pegs in length, "rgybvgbiyoroggviiybo".

You will find the code to implement a genetic algorithm and some simple examples in ga.scm Running the GA involves setting up the problem and then using (run-ga 10 10 1000) when the numbers describe the number of trials, population size, and maximum number of generations.

Part A

Formulate mastermind as a GA. Specifically, specify the mapping to genes and chromosomes. Describe a quality measure.

Part B

Implement your genetic algorithm formulation of mastermind. Use the strings example as a model. You will have to modify the letters and alphabet. Run the GA to solve mastermind.

Part C

The default behavior of the GA uses the standard method to determine fitness. An implementation of the rank method is also available. Rerun your mastermind GA with the rank method of fitness. Compare the results.

Background

Previous homeworks have explored the representation and manipulation of knowledge to produce inferences. Using mechanisms such as rule-based systems and constraint satisfaction, we have seen how knowledge about the world and about specific problems can be encoded in different ways. Further, we have seen how the choices of knowledge representation can be manipulated in different ways. However, all these techniques share the fundamental requirement that the knowledge be captured manually by some knowledge engineer to enable later automatic inference. This homework begins our exploration of general techniques that can use the regularities in observed data to acquire functions from inputs vectors to output values or classifications. In other words, we will explore systems that can automatically learn from examples. We begin with two well-known and widely used learning techniques that learn functions from labeled training data: nearest neighbor and identification trees.

Problem 1

Picking Moves

You've studied both alpha-beta minimax game play and nearest neighbor learning. An impatient friend suggests that all that search is simply too complicated and unnecessary since there are many on-line game servers that record the moves of a game. Your friend suggests that rather than doing alpha-beta you should simply download all the games from a server and use a memory-based technique. For any given game position (e.g. board position in a board game like chess or hand of cards in a card game), find a move in the server's collection that matches that position exactly and then do the move made by the player in the recorded game. You argue that things, unfortunatly, aren't that simple. Give three problems with your friend's approach to game-playing. Please keep your answers brief (2-3 lines each).

Problem 2

Picking Concentrations

NOTE:Yes, this looks a lot like the decision tree homework. The problem set-up is the same but the task is to use nearest neighbor this time.

You've been told many time how important it is to pick the right concentration for your interests and skills - and, of course, your future employability. After many visits to guidance counselors with varying degrees of success and after studying some machine learning techniques in CMSC2500, you conclude that there should be some way to distill their knowledge into an automatic process. After all, gaining experience is about seeing many examples and determining what works and what doesn't. You decide to take a first pass at designing such a system to predict the concentration of a student based on their performance in some preliminary classes. You decide to focus on a few classes of concentrations first, rather than trying to make fine-grained decisions. You want to route students to one of the four following areas: physical sciences, biological sciences, business, and humanities. You use grades from courses typically taken in the first few terms broken down into five classes:

Math: Calculus I, Calculus II, Discrete Math
Science: Physics I, Chemistry I, Biology I
Languages: French I, French II, Spanish I
Business: Micro-economics, Macro-economics, Marketing
Humanities: Modern European History, English Composition, Music

Part A

For a nearest neighbor classifier, would you recommend using number grades or letter grades? Briefly explain your answer.

Part B

The basic code for nearest neighbor is in nearest.scm . Some training samples are in newdata2.trn Test data is in newdata2.tst However, you remember that one of the key issues in nearest neighbor learning is the distance metric. How else would you know what's "nearest"? You decide that a basic weighted sum of squares distance measure is the best choice for this task. Write and turn in the code for your distance measure by filling in the necessary code in (distance2 u n) .

Part C

You realize that it's quite likely that many students seeking guidance may not have complete all of the courses on which you are basing your assessment. However, you'd still like to be able to make use of the classifier that you have developed. Modify the distance metric you developed in Part B to handle the case where some of the values in the input vector of class values are missing. Assume that a missing grade will be represented by -1 . Test your revised function on newdata2-missing.tst .

Part D

Finally, you conclude that in order to obtain as much training data as possible you would also like to be able to use learn from student cases where they have not necessarily taken all of the classes you have identified as input features. Now, modify the distance metric you developed above to accept both training examples and test instances with missing features. Furthermore, the metric should be consistent even if different numbers of features are available for comparison. Test your revised function by training on newdata2-missing.trn and testing on newdata2-missing.tst .

Part E

You are concerned that your classifier is over-sensitive to individual training examples. You remember that there is a variation of nearest neighbor that incorporates information from more than a single closest neighbor. Modify the nearest neighbor code to perform k-nearest neighbor classification. Turn in your results for k=3 and k=5.

CMSC 25000 - Artificial Intelligence Winter 2004 Homework #4: Due February 3, 2004

Goals

Background:

Problem 1: Mastermind

Part A

Part B

Part C

Background

Problem 1

Picking Moves

Problem 2

Picking Concentrations

Part A

Part B

Part C

Part D

Part E

CMSC 25000 - Artificial Intelligence
Winter 2004
Homework #4: Due February 3, 2004