CMSC 25000 - Artificial Intelligence
Winter 2004
Homework #4: Due February 3, 2004
Goals
Through this assignment you will:
- Explore the use of simulated evolution to search complex spaces for optimal solutions.
- Analyze the effect of different evolution strategies and fitness measures
on genetic algorithm success
- Learn how to formulate search problems as genetic algorithms.
- Build memory-based learning systems.
- Explore the issues that arise in such systems including feature selection and handling of missing features.
Background:
Genetic algorithms build on a biological analogy
as mechanism for local search and learning.
The inspiration lies in Darwin's theory of evolution by natural selection.
In natural evolution, individuals have traits carried by genes on chromosomes
that are passed on to offspring. Variation in genetic make-up arise
through mutation - copying errors - and crossover combining material from
different chromosomes. The fittest individuals survive to have more
offspring.
Genetic algorithms map the structure of the search problem onto
evolutionary analogs. The features or parameters of the search space
map to genes and are grouped into chromosomes. Mutations correspond
to random changes in some number of feature values. Crossover likewise
corresponds to swapping feature values for some subset of genes
between a pair of chromosomes. Finally one has to define a measure
of fitness. We explored the standard method, rank method, and rank-space
methods of computing fitness as the probability of an individual's survival
to the next generation. These measures draw on assessments of quality,
diversity or both.
Problem 1: Mastermind
Mastermind is a game in which one player creates a pattern of
colored pegs, which they hide from the other player. The other
player's goal is to guess the hidden pattern of pegs based on
limited feedback (number of correct colors, number of correctly
positioned pegs). Here we will use GAs to evolve a solution to
the mastermind matching game.
In this variant of the game, the pegs will take on the colors of
the rainbow: red, orange, yellow, green, blue, indigo, violet.
Since we have all the processing power of a computer to solve
this problem, we will try to guess a longer than typical string,
20 pegs in length, "rgybvgbiyoroggviiybo".
You will find the code to implement a genetic algorithm and
some simple examples in ga.scm
Running the GA involves setting up the problem and then
using (run-ga 10 10 1000) when the numbers
describe the number of trials, population size, and maximum
number of generations.
Part A
Formulate mastermind as a GA. Specifically, specify the
mapping to genes and chromosomes. Describe a quality measure.
Part B
Implement your genetic algorithm formulation of mastermind.
Use the strings example as a model. You will have to modify
the letters and alphabet. Run the GA to solve mastermind.
Part C
The default behavior of the GA uses the standard method to
determine fitness. An implementation of the rank method is
also available. Rerun your mastermind GA with the rank
method of fitness. Compare the results.
Background
Previous homeworks have explored the representation and manipulation of
knowledge to produce inferences. Using mechanisms such as rule-based systems
and constraint satisfaction, we have seen how knowledge about the world
and about specific problems can be encoded in different ways. Further,
we have seen how the choices of knowledge representation can be
manipulated in different ways. However, all these techniques share
the fundamental requirement that the knowledge be captured manually
by some knowledge engineer to enable later automatic
inference.
This homework begins our exploration of general techniques that can
use the regularities in observed data to acquire functions from
inputs vectors to output values or classifications. In other words,
we will explore systems that can automatically learn from examples.
We begin with two well-known and widely used learning techniques
that learn functions from labeled training data: nearest neighbor
and identification trees.
Problem 1
Picking Moves
You've studied both alpha-beta minimax game play and nearest neighbor
learning. An impatient friend suggests that all that search is
simply too complicated and unnecessary since there are many on-line
game servers that record the moves of a game. Your friend suggests
that rather than doing alpha-beta you should simply download all the
games from a server and use a memory-based technique. For any given
game position (e.g. board position in a board game like chess or
hand of cards in a card game), find a move in the server's collection
that matches that position exactly and then do the move made by the
player in the recorded game.
You argue that things, unfortunatly, aren't that simple. Give
three problems with your friend's approach
to game-playing. Please keep your answers brief (2-3 lines each).
Problem 2
Picking Concentrations
NOTE:Yes, this looks a lot like the
decision tree homework. The problem set-up is the same but
the task is to use nearest neighbor this time.
You've been told many time how important it is to pick the
right concentration for your interests and skills - and, of
course, your future employability. After many visits to
guidance counselors with varying degrees of success and after
studying some machine learning techniques in CMSC2500, you
conclude that there should be some way to distill their knowledge
into an automatic process. After all, gaining experience is
about seeing many examples and determining what works and what
doesn't.
You decide to take a first pass at designing such a system to
predict the concentration of a student based on their performance
in some preliminary classes. You decide to focus on a few classes
of concentrations first, rather than trying to make fine-grained
decisions. You want to route students to one of the four following
areas: physical sciences, biological sciences, business,
and humanities. You use grades from courses typically taken in the
first few terms broken down into five classes:
- Math: Calculus I, Calculus II, Discrete Math
- Science: Physics I, Chemistry I, Biology I
- Languages: French I, French II, Spanish I
- Business: Micro-economics, Macro-economics, Marketing
- Humanities: Modern European History, English Composition, Music
Part A
For a nearest neighbor classifier, would you recommend using
number grades or letter grades? Briefly explain
your answer.
Part B
The basic code for nearest neighbor is in nearest.scm .
Some training samples are in newdata2.trn
Test data is in newdata2.tst
However, you remember that one of the key issues in nearest neighbor
learning is the distance metric. How else would you know what's "nearest"?
You decide that a basic weighted sum of squares distance measure
is the best choice for this task.
Write and turn in the code for your distance measure by filling in the
necessary code in (distance2 u n) .
Part C
You realize that it's quite likely that many students seeking guidance
may not have complete all of the courses on which you are basing your
assessment. However, you'd still like to be able to make use of
the classifier that you have developed. Modify the distance metric
you developed in Part B to handle the case where some of the values
in the input vector of class values are missing. Assume that
a missing grade will be represented by -1 .
Test your revised function on newdata2-missing.tst .
Part D
Finally, you conclude that in order to obtain as much training data
as possible you would also like to be able to use learn from student
cases where they have not necessarily taken all of the classes you
have identified as input features.
Now, modify the distance metric you developed above to accept
both training examples and test instances with missing features.
Furthermore, the metric should be consistent even if different
numbers of features are available for comparison.
Test your revised function by training on newdata2-missing.trn and testing on newdata2-missing.tst .
Part E
You are concerned that your classifier is over-sensitive to
individual training examples. You remember that there
is a variation of nearest neighbor that incorporates information
from more than a single closest neighbor. Modify the nearest
neighbor code to perform k-nearest neighbor classification.
Turn in your results for k=3 and k=5.