Linear Regression¶
Due: Friday, Nov 16 at 4pm
In this assignment, you will fit linear regression models and implement a few simple variable selection algorithms. The assignment will give you experience with NumPy and more practice with using classes and functions to support code reuse.
You must work alone on this assignment.
At the heart of the assignment is a table, where each column is a variable and each row is a sample unit. As an example, in a health study, each sample unit might be a person, with variables like height, weight, sex, etc. In your analysis, you will build models that, with varying levels of accuracy, can predict the value of one of the variables as a function of the others.
Predictions are only possible if variables are related somehow. As an example, look at this plot of recorded crimes against logged complaint calls about garbage to 311.
Each point describes a sample unit, which in this example represents a geographical region of Chicago. Each region is associated with variables, such as the number of crimes or complaint calls during a fixed time frame. Given this plot, if you got the question of how many crimes you think were recorded for a region that had 150 complaint calls about garbage, you would follow the general trend and probably say something like 3000 recorded crimes. To formalize this prediction, we need a model for the data that relates a dependent variable (e.g., crimes) to a set of predictor variables (e.g., complaint calls). We will assume a linear dependence as follows:
or using vector notation
We have introduced the following notation:
- \(n\)
- The sample unit that we are considering out of \(N\) total.
- \(y_n\)
- An observation of the dependent variable, e.g., the total number of crimes.
- \(x_{nk}\)
- An observation of a predictor variable \(k\) for sample unit \(n,\) e.g., the complaint calls about garbage. We assume a total of \(K\) different predictors.
- \(\beta = (\beta_0, \beta_1, \dots, \beta_K)^T\)
- A column vector of the regression coefficients, where \(\beta_0\) is the intercept and \(\beta_k\) is the coefficient associated with the \(k\mathrm{th}\) predictor. This vector describes the red line in the figure above. Note that a positive value of a coefficient suggests a positive correlation with the dependent variable. The same is true for a negative value and a negative correlation.
- \(\varepsilon_n\)
- Not all sample unit points lie on the red line, which means the equation \(y_n = \mathbf{x}_n^T \beta\) is rarely satisfied. In our model, this discrepancy is assumed to be caused by random error. We assume that the random error is zero-mean and drawn from a normal distribution of a fixed variance. We want as much as possible to be explained deterministically, which means we will favor values of \(\beta\) that result in small errors.
- \(\mathbf{x}_n = (1, x_{n1}, \dots, x_{nK})^T\)
- A column vector representation of all the predictors for a given sample unit. As a convenience, a 1 has been prepended to the vector to incorporate the intercept \(\beta_0\) into the vector notation.
When making predictions, what we are really doing is asking what is the weighted average value for a given set of predictors. Since we have assumed the errors have zero mean, to make a prediction, we just remove the error term as:
This equation can be written for several sample units at the same time as:
where \(\mathbf{\hat y}\) is a column vector of predictions and \(\mathbf{X}\) is an \(N \times K\) matrix where each row is one sample unit.
Model fitting¶
There are many possible candidates for \(\beta,\) some that fit the data
better than others. Finding the best value of \(\beta\) is referred to as
fitting the model. To simplify your job, we have provided code for fitting a
linear model. The function linear_regression
finds a value \(\beta\)
that minimizes \(\mathbf{\hat y} - \mathbf{y}\) in a least-squared sense.
That is, it finds values for \(\beta\) such that the predicted values
\(\mathbf{\hat y} = \mathbf{X} \beta\) are as close to the observed
variables as possible (in a statistically motivated way using maximum
likelihood). This function accepts a two-dimensional NumPy array of floats
containing the predictor variables (X
) and a one-dimensional NumPy array of
floats (y
) containing the dependent variable as input and returns
\(\beta.\) Note that X
does not correspond exactly with
\(\mathbf{X},\) since it should not have the column of ones (we will add
that for you).
We have also provided a function, apply_beta(beta, X)
for making
predictions using the model. It takes beta
, the array generated by
our linear regression function, and X
, a two-dimensional NumPy
array of floats that contains values for the predictor variables
(again, without the added column of 1’s), applies the function
described by beta
to each row in X
, and returns a
one-dimensional NumPy array of the resulting values.
Here are sample calls to linear_regression
and apply_beta
:
>>> import numpy as np
>>> from util import *
>>> X = np.array([[5,2], [3,2], [6,2.1], [7,3]]) # predictors
>>> y = np.array([5,2,6,6]) # dependent
>>> beta = linear_regression(X, y)
>>> beta
array([ 1.20104895, 1.41083916, -1.6958042 ]) # f(x) = 1.2 + 1.4*x1 - 1.7*x2
>>> apply_beta(beta, X)
array([ 4.86363636, 2.04195804, 6.1048951 , 5.98951049]) # predictions
The input to both of these functions require X
to be
two-dimensional, even in the case of K = 1. NumPy makes a
distinction between one-dimensional and two-dimensional arrays, even
if they contain the same information. For instance, if we want to
extract the second column of X
as a one-dimensional array, we do
the following:
>>> X
array([[ 5. , 2. ],
[ 3. , 2. ],
[ 6. , 2.1],
[ 7. , 3. ]])
>>> X[:,1]
array([ 2. , 2. , 2.1, 3. ])
>>> X[:,1].shape
(4,)
The resulting shape will not be accepted by linear_regression
and
apply_beta
. To retrieve a 2D column subset of X
, you can use a
list of integers as the index. This mechanism keeps X
two-dimensional, even if the list of indices has only one value:
>>> X[:,[1]]
array([[ 2. ],
[ 2. ],
[ 2.1],
[ 3. ]])
>>> X[:,[1]].shape
(4, 1)
In general, you can specify a slice containing a specific subset of
columns as a list. For example, let Z
be:
>>> Z = np.array([[1, 2, 3, 4], [11, 12, 13, 14], [21, 22, 23, 24]])
>>> Z
array([[ 1, 2, 3, 4],
[11, 12, 13, 14],
[21, 22, 23, 24]])
Evaluating the expression Z[:, [0,2,3]]
will yield a new 2D array
with columns 0, 2, and 3 from the array Z
:
>>> Z[:, [0, 2, 3]]
array([[ 1, 3, 4],
[11, 13, 14],
[21, 23, 24]])
or more generally, we can specify any expression for the slice that yields a list:
>>> l = [0, 2, 3]
>>> Z[:, l]
array([[ 1, 3, 4],
[11, 13, 14],
[21, 23, 24]])
Note: ignore the FutureWarning raised by invoking NumPy’s linear regression function.
Data¶
In this assignment you will write code that can be used on many different datasets. We have provided a couple of sample datasets for testing purposes.
City: | Predicting crime rate from the number of complaint calls to 311. |
---|---|
Stock: | Predicting a stock index from a set of stocks. |
More information about each dataset can be found by clicking the links above.
For this assignment, a dataset is stored in a directory that contains two files: a CSV (comma-separated values) file and a JSON file. The CSV file contains a table where each column corresponds to a variable and each row corresponds to a sample unit. The first row contains the column names. The JSON file contains a few parameters:
name
: the name of the dataset,predictor_vars
: the column indices of the predictor variables,dependent_var
: the column index of the dependent variable,training_fraction
: the fraction of the data that should be used to train models, andseed
: a random number generator seed.
The last two parameters will be used to help split the data into two sets: one that you will use to fit or train models and one that you will use to evaluate how well different models predict outcomes on new-to-the-model data. We describe this process below.
Getting started¶
We have seeded your repository with a directory for this assignment.
To pick it up, change to your capp30121-aut-18-username
directory
(where the string username
should be replaced with your username)
and then run the command: git pull upstream master
. You should
also run git pull
to make sure your local copy of your repository
is in sync with the server.
The pa5
directory contains the following files:
regression.py : | Python file where you will write your code. |
---|---|
util.py : | Python file with several helper functions, some of which you will need to use in your code. |
output.py : | This file is described in detail in the “Testing your code” section below. |
test_regression.py : | |
Python file with the automated tests for this assignment. |
The pa5
directory also contains a data
directory which, in turn,
contains two sub-directories: city
and stock
.
We will be using the scikit-learn machine learning library in this assignment. It is installed on the department’s Linux machines, but not on the class VM. To install it on your VM, run:
sudo -H pip3 install scipy sklearn
You will be asked for a password. Use the student account password:
uccs
.
Task 0: The DataSet
and Model
classes¶
In this assignment, you will be working with datasets and models,
so it will make sense for us to have a DataSet
class and a
Model
class, which you will implement in regression.py
(we have provided basic skeletons for those classes in that file)
This task describes the requirements for the DataSet
and
Model
classes and, while you should give some thought to the design
of these classes before moving on to the rest of the assignment,
it is very likely you will have to reassess your design throughout
the assignment (and you should not get discouraged if you find
yourself doing this). We suggest you start by writing a first
draft of the DataSet
and Model
classes, and then
fine-tune them as you work through the rest of the assignment.
Representing Datasets
Recall that each dataset has two files: a CSV file with the data to be
used for building our models and a JSON file with the
parameters for the assignment. In the file util.py
, we have
provided two functions: one to read the sample data from the CSV file
(load_numpy_array
)
and one to read the parameters from the JSON file (load_json_file
).
The DataSet
class will be very simple, and should only require a constructor
that takes the name of the directory holding the dataset as its only
parameter. In addition to saving the values necessary for the
computation from the JSON and CSV files (dataset name, predictor
variable indices, dependent variable index, and the column labels),
your constructor should also split the sample data into two NumPy
arrays: one, called the training data, that you will use to construct
models in Tasks #1-#4, and another, called the testing data,
that you will use in Task #5 to evaluate how well different models
predict new-to-the-model data.
Why not use all the data to fit the models? It is easy to fit a model that does a good job of predicting the dependent variable for the sample units that were used to train it, but does a terrible job of predicting the dependent variable for other data. How do we determine how well a model does with new-to-it data? We can hold out some of the data and use the model to predict the values of the dependent variable for the held-out sample units. Since we know the actual value of the dependent variable for this held-out data, we can use it to evaluate the effectiveness of the model on previously-unseen data.
We will use the train_test_split function from the Scikit-learn
model selection library to make the split. This method takes a NumPy
array and splits it into two arrays: one to use for training models
and another to use for testing them. It has two keyword parameters that
can be used to control the size of each array: train_size
and
test_size
. We’ll set train_size
using the
training_fraction
parameter from the parameters file and we will
set the test_size
parameter to None
. Using None
for the
test_size
parameter indicates that the rows not chosen for the
training set should be included in the test set, which is the default
behavior. Note that you could omit this parameter, but the function
generates an annoying warning message if you do.
The train_test_split
method decides randomly which rows to include
in which set. To support repeatability and simplify testing, it
includes a random_state
parameter that can be used to set the seed
for the underlying random process. Like previous assignments,
this will ensure that, from a given seed, the random process
will select the same rows to
be assigned to the same sets each time you run your program.
We’ll set this parameter using the
seed
parameter from the JSON file.
Your implementation of the DataSet
constructor should be quite
simple. Ours is less than 10 lines of code. The heavy lifting is
done in the functions we provide in util.py
, as well as the train_test_split
method from Scikit-learn.
Representing models
Implementing a Model
class will allow
us to write functions that take Model
objects as parameters,
and which can return Model
objects as well, instead of having
to pass around lists or dictionaries with the information for
a given model.
Each model will be constructed from a dataset and a subset of the dataset’s predictor variables. We encourage you to think carefully about what attributes and methods to include in your class.
Take into account that, for the tests to work, your Model
class
must have at least the following public attributes (don’t worry
if you don’t understand what these mean right now; the will become
clearer in subsequent tasks):
R2
: The value of R2 for the modeladj_R2
: The value of the Adjusted R2 for the modeldep_var
: The index of the dependent variable.pred_vars
: A list of integers containing the indices of the predictor variables.beta
: A numpy array with the model’s \(\beta\)
As we will describe later, it can also be very helpful to include
a __repr__
method.
Task 1: Model evaluation using the Coefficient of Determination (R2)¶
Blindly regressing on all available predictors is bad practice and can lead to over-fit models. Over-fitting happens when your model picks up on random structure in the training data that does not represent the underlying phenomenon. This situation is made even worse if some of your predictors are highly correlated (see multicollinearity), since it can ruin the interpretation of your coefficients. As an example, consider a study that looks at the effects of smoking and drinking on developing lung cancer. Smoking and drinking are highly correlated, so the regression can arbitrarily pick up either as the explanation for high prevalence of lung cancer. As a result, drinking could be assigned a high coefficient while smoking a low one, leading to the questionable conclusion that drinking causes lung cancer while smoking does not.
This example illustrates the importance of variable selection. In order to compare two different models (subsets of predictors) against each other, we need a measurement of their goodness of fit, i.e. how well they explain the data. A popular choice is the coefficient of determination, R2, which first considers the variance of the dependent variable, \(\mathrm{Var}(y)\):
In this example we revisit the regression of the total number of crimes, y, on the number of calls about garbage to 311, x. To the right, a histogram over y is shown and its variance is calculated. The variance is a measure of how spread out the data is. It is calculated as follows:
where \(\bar y\) denotes the mean of all y’s.
We now subtract the red line from each point. These new points are called the residuals and represent the deviations from the expected values under our model. If the variance of the residuals, \(\mathrm{Var}(y - \hat y),\) is zero, it means that all residuals had to be zero and thus all sample units lie perfectly on the fitted line. On the other hand, if the variance of the residuals is the same as the variance of y, it means that the model did not explain anything away and we can consider the model useless. These two extremes represent R2 values of 1 and 0, respectively. All R2 values in between these two extremes will be cases where the variance was reduced some, but not entirely. Our example is one such case:
As we have hinted, the coefficient of determination, R2, measures how much the variance was reduced by subtracting the model. More specifically, it calculates this reduction as a percentage of the original variance:
which finally becomes:
In these steps we omitted the normalization constants N since they cancel each other out. We also did not subtract the mean when calculating the variance of the residuals, since it can be shown mathematically that if the model has been fit by least squares, the sum of the residuals is always zero.
There are two properties of R2 that are good to keep in mind when checking the correctness of your calculations:
- 0 ≤ R2 ≤ 1
- If model A contains a superset of predictors of model B, then model A cannot have a lower R2 value than B.
These properties are only true if R2 is computed using the same data that was used to train the model parameters. If you calculate R2 on a held-out testing set, the R2 value could decrease with more predictors (due to over-fitting) and R2 is no longer guaranteed to be greater than or equal to zero. The intuition behind R2 would also need to be reconsidered, but since the final formula in (1) remains the same we will omit the details here.
A good model should explain as much of the variance as possible, so we will be favoring higher R2 values. However, since using all predictors gives the highest R2, we must balance this goal with a desire to use few predictors. In general it is important to be cautious and not blindly interpret R2; take a look at these two generated examples:
To the left, we see that the fitted model is close to the one we used to generate the data. However, the variance of the error is still high, so the R2 is rather low. To the right is a model that clearly is not right for the data, but still manages to record a high R2 value. This example also shows that that linear regression is not always the right model choice. However, in many cases, a simple transformation of the predictors or the dependent variable can resolve this problem, making the linear regression paradigm still relevant.
In this task, you will do the following:
Task 1a¶
You must implement this function in regression.py
:
def compute_single_var_models(dataset):
'''
Computes all the single-variable models for a dataset
Inputs:
dataset: (DataSet object) a dataset
Returns:
List of Model objects, each representing a single-variable model
'''
More specifically, given a dataset with P predictor variables, you will construct P univariate (i.e., one variable) models of the dataset’s dependent variable, one for each of the possible predictor variables, and compute their associated R2 values. Note that when we ask for the R2 value for a model, we mean the value obtained by computing R2 using the model’s \(\beta\) on the data that was used to train it.
Most of the work in this task will be in your Model
class. In fact,
if you’ve designed Model
correctly, your implementation of
compute_single_var_models
shouldn’t be longer
than three lines long (ours is actually just one line long).
Also, notice how this function (and the next one we’ll ask you to implement)
doesn’t return R2. That value should be an attribute of your Model
,
so there’s no need to return it as an additional return value.
Note
You may not use the NumPy var
method to compute R2 for
two reasons: (1) our computation uses the biased variance while
the var
method computes the unbiased variance and will lead
you to generate the wrong answers and (2) we want you to get
practice working with NumPy arrays.
Task 1b¶
You must implement this function in regression.py
:
def compute_all_vars_model(dataset):
'''
Computes a model that uses all the predictor variables in the dataset
Inputs:
dataset: (DataSet object) a dataset
Returns:
A Model object that uses all the predictor variables
'''
More specifically, you will construct a single model that uses all of the dataset’s predictor variables to predict the dependent variable. According to the second property of R2, the R2 for this model should be the largest R2 value you will see for the training data.
At this point, you should make sure that your code to calculate a model’s R2 value is general enough to handle models of with multiple predictor variables (i.e. multivariate models).
Take into account that, if you have a good design for Model
,
then implementing Task 1b should be trivial after implementing Task 1a.
If it doesn’t feel that way, ask on Piazza or come to office hours
so we can give you some quick feedback on your design.
Testing your code¶
As usual, you will be able to test your code from IPython and by using py.test
.
When using IPython, make sure to enable autoreload before importing the regression.py
module:
In [1]: %load_ext autoreload
In [2]: %autoreload 2
In [3]: import regression
You could then test your Task 1a code by creating a DataSet
object (let’s assume
we create a dataset
variable for this) and run this:
In [3]: univar_models = compute_single_var_models(dataset)
univar_models
should then contain a list of Model
objects. However, checking
the R2 values manually from IPython can be cumbersome, so we have included
a Python program, output.py
, that will print out these (and other) values
for all of the tasks. You can run it from the command-line like this:
$ python3 output.py data/city
If your implementation of Task 1a is correct, you will see this:
City Task 1a
!!! You haven't implemented the Model __str__ method !!!
R2: 0.1402749161031348
!!! You haven't implemented the Model __str__ method !!!
R2: 0.6229070858532733
!!! You haven't implemented the Model __str__ method !!!
R2: 0.5575360783921093
!!! You haven't implemented the Model __str__ method !!!
R2: 0.7831498392992615
!!! You haven't implemented the Model __str__ method !!!
R2: 0.7198560514392482
!!! You haven't implemented the Model __str__ method !!!
R2: 0.32659079486818354
!!! You haven't implemented the Model __str__ method !!!
R2: 0.6897288976957778
You should be producing the same R2 values shown above, but we can’t tell
what model they each correspond to! For output.py
to be actually useful, you
will need to implement the __str__
method in Model
.
We suggest your string representation be in the following form: the name of the
dependent variable followed by a tilde (~
) and the regression
equation with the constant first. Please print no more than six decimal places.
If you do this, the output for Task 1a will now look like this:
City Task 1a
CRIME_TOTALS ~ 575.687669 + 0.678349 * GRAFFITI
R2: 0.1402749161031348
CRIME_TOTALS ~ -22.208880 + 5.375417 * POT_HOLES
R2: 0.6229070858532733
CRIME_TOTALS ~ 227.414583 + 7.711958 * RODENTS
R2: 0.5575360783921093
CRIME_TOTALS ~ 11.553128 + 18.892669 * GARBAGE
R2: 0.7831498392992615
CRIME_TOTALS ~ -65.954319 + 13.447459 * STREET_LIGHTS
R2: 0.7198560514392482
CRIME_TOTALS ~ 297.222082 + 10.324616 * TREE_DEBRIS
R2: 0.32659079486818354
CRIME_TOTALS ~ 308.489056 + 10.338500 * ABANDONED_BUILDINGS
R2: 0.6897288976957778
For example, the first model uses only one variable, GRAFFITI
to predict crime,
and has an R2 of only 0.14027. The last model uses only ABANDONED_BUILDINGS
, and
the higher R2 (0.6897) tells us that ABANDONED_BUILDINGS
is likely a better
predictor than GRAFFITI
.
Take into account that you can also run output.py
with the Stock dataset:
$ python3 output.py data/stock
The full expected output of output.py
can be found in in the
:City: and :Stock: pages. However, please note that you do not need to
check all this output manually. We have also included automated tests
that will do these checks for you. For example, in Task 1 you can
run the following:
$ py.test -vk task1
No hard-coding!
Since we only have two datasets, it can be very easy in some tasks to write code that will pass the tests by hard-coding the expected values in the function (and returning one or the other depending on the dataset that is being used)
** If you do this, you will receive ZERO POINTS on your entire test score, regardless of whether other tasks are implemented correctly without hard-coding **
Task 2: Building and selecting bivariate models¶
If you look at the output for Task 1a, none of the predictor variables individually are particularly good predictors for the dependent variable (they all had low R2 values compared to when using all predictors). In this and subsequent tasks, we will construct better models using multiple variables without using all of them and risking over-fitting.
For example, we could predict crime using not only complaints about garbage, but graffiti as well. For example, we may want to find \(\beta = (\beta_0, \beta_\mathrm{garbage}, \beta_\mathrm{graffiti})\) in the equation
where \(f(\mathbf{x})\) is a prediction of the number of crimes given the number of complaint calls about garbage \((x_\mathrm{garbage})\) and graffiti \((x_\mathrm{graffiti}).\)
For this task, you will test all possible bivariate models (K = 2) and determine the one with the highest R2 value. We suggest that you use two nested for-loops to iterate over all combinations of two predictors, calculate the R2 value for each combination and keep track of one with the highest R2 value.
To do this task, you will implement this function in regression.py
:
def compute_best_pair(dataset):
'''
Find the bivariate model with the best R2 value
Inputs:
dataset: (DataSet object) a dataset
Returns:
A Model object for the best bivariate model
'''
Unlike the functions in Task 1, you can expect to write more code in this
function. In particular, you should not include the code to loop over
all possible bivariate models within your Model
class; it should instead
be in this function.
You can test this function by running this:
$ py.test -vk task2
You should also look at the output produced by output.py
. How does the
bivariate model compare to the single-variable models in Task 1? Which pair
of variables perform best together? Does this result make sense given the
results from Task 1? We don’t need you to answer these questions anywhere:
they’re just food for thought!
Task 3: Building models of arbitrary complexity¶
How do we determine how many and which variables will generate the best model? It turns out that this question is unsolved and is of interest to both computer scientists and statisticians. To find the best model of three variables we could repeat the process in Task 2 with three nested for-loops instead of two. For K variables we would have to use K nested for-loops. Nesting for-loops quickly becomes computationally expensive and infeasible for even the most powerful computers. Keep in mind that models in genetics research easily reach into the thousands of variables. How then do we find optimal models of several variables?
We’ll approach this problem by using heuristics, which will give us an approximate (as opposed to an exact) result. We are going to split this task into two pieces: (1) determine the best K variables to use for a fixed K and (2) determine the best value for K.
How do we determine which K variables will yield the best model for a fixed K? As noted, the naive approach, test all possible combinations as in Task 2, is intractable for large K. An alternative simpler approach is to choose the K variables with the K highest R2 values in the table from Task 1. This approach does not work well if your predictors are correlated. As you’ve already seen from the first two tasks, neither of the top two individual predictors for crime (GARBAGE and STREET_LIGHTS) are included in the best bivariate model (CRIME_TOTALS ~ -36.151629 + 3.300180 * POT_HOLES + 7.129337 * ABANDONED_BUILDINGS).
Instead, you’ll implement a heuristic known as Backward elimination, which is an example of a greedy algorithm. Backward elimination starts with a set that contains all potential predictor variables and then repeatedly eliminates variables until the set contains K variables. At each step, the algorithm requires you to identify the variable in the model that when subtracted, yields the model with the best R2 value and eliminate it from the set.
To do this task, you will implement this function in regression.py
:
def backward_selection(dataset):
'''
Given a dataset with P predictor variables, uses backward selection to
compute the best models for every value of K between 1 and P.
Inputs:
dataset: (DataSet object) a dataset
Returns:
A list (of length P) of Model objects. The first element is the
model where K=1, the second element is the model where K=2, and so on.
'''
You can test this function by running this:
$ py.test -vk task3
Task 4: Selecting the best K¶
In Task 3, you computed a list of the best K-variable models for \(1\leq K \leq P\) using backward elimination. How do we choose among these models? The calculation of R2 does not take into account the number of variables in a model and so, while it is a valid way to choose among different models, each with K variables, it is not a good way to choose among different values for K. Adjusted R2, which does take the number of predictor variables in the models into account, is a reasonable alternative.
Essentially, adjusted R2 replaces the biased variances used in R2 with their unbiased counterparts.) Here’s a formulation for it that highlights the adjustment, rather than the change in the definition of variance:
where \(N\) is the sample size (i.e. number of rows in the training data) and \(K\) is the number of predictor variables in the model. So, we need to find the model with the best adjusted R2 value.
To do this task, you will implement this function in regression.py
:
def choose_best_model(dataset):
'''
Given a dataset, choose the best model produced
by backwards selection (i.e., the model with the highest
adjusted R2)
Inputs:
dataset: (DataSet object) a dataset
Returns:
A Model object
'''
Note that this function should not repeat the backwards selection algorithm; you should simply call the function from Task 3.
You can test this function by running this:
$ py.test -vk task4
Task 5: Training vs. test data¶
Please read this section very carefully
Until now, you have evaluated a model using the data that was used to train it. The resulting model may be quite good for that particular dataset, but it may not be particularly good at predicting novel data. This is the problem that we have been referring to as over-fitting.
Up to this point, we have only computed R2 values for the training data that was used to construct the model. It is also valid to compute an R2 value for a model when applied to other data. For this task, you will use the *testing data* to evaluate the model identified in Task 4. You will receive *zero credit* for this task if you train a new model using the testing data.
To do this task, you will implement this function in regression.py
:
def validate_model(dataset, model):
'''
Given a dataset and a model trained on the training data,
compute the R2 of applying that model to the testing data.
Inputs:
dataset: (DataSet object) a dataset
model: (Model object) A model that must have been trained
on the dataset's training data.
Returns:
(float) An R2 value
'''
If your Model
class is well designed, this function should not require
more than one line (or, at most, a few lines) to implement. If that is
not the case, please come to office hours or ask on Piazza so we can
give you some quick feedback on your design.
You can test this function by running this:
$ py.test -vk task5
Grading¶
Programming assignments will be graded according to a general rubric. Specifically, we will assign points for completeness, correctness, design, and style. (For more details on the categories, see our PA Rubric page.)
The exact weights for each category will vary from one assignment to another. For this assignment, the weights will be:
- Completeness: 50%
- Correctness: 15%
- Design: 20%
- Style: 15%
You must include header comments in all the methods you implement except any getters or setters that you happen to define.
Obtaining your test score¶
Like previous assignments, you can obtain your test score by running py.test
followed by ../common/grader.py
.
Continuous Integration¶
Continuous Integration (CI) is available for this assignment. For more details, please see our Continuous Integration page. We strongly encourage you to use CI in this and subsequent assignments.
Submission¶
To submit your assignment, make sure that you have:
- put your name at the top of your file,
- registered for the assignment using chisubmit,
- added, committed, and pushed
regression.py
to the git server, and - run the chisubmit submission command.
Here are the relevant commands to run on the Linux command-line.
(Remember to leave out the $
prompt when you type the command.)
$ chisubmit student assignment register pa5
$ git add regression.py
$ git commit -m"final version of PA #5 ready for submission"
$ git push
$ chisubmit student assignment submit pa5
Remember to push your code to the server early and often! Also, remember that you can submit as many times as you like before the deadline.
Acknowledgments: This assignment has been worked on by many people over the years, including Matthew Rocklin and Gustav Larsson.