import pandas as pd
Lab 09: Ames Home Prices
Scenario: You work for an online real estate listing aggregator, perhaps similar to a well known company whose name rhymes with Pillow. You are tasked with predicting the sale value of homes given features of the home such as size, number of bathrooms, etc. Users of the website can use your predictions to evaluate actual list prices if they are buying. If they are selling, they can use your predictions when considering at what price to list their home.
Goal
The goal of this lab is to develop a model to predict the selling price of homes in Ames, Iowa, a regression task.
Data
The data (and task) for this lab originally comes from Kaggle.
You should not use this data, but instead the data provided below. However, the descriptions of the variables found on Kaggle will be useful. (We will not repeat them here as there are many!)
Data in Python
To load the data in Python, use:
= pd.read_csv("https://cs307.org/lab-09/data/ames-train.csv")
ames_train = pd.read_csv("https://cs307.org/lab-09/data/ames-test.csv") ames_test
Prepare Data for Machine Learning
Because the data is already train-test split, we can simply create the X
and y
variants of the data.
# create X and y for train dataset
= ames_train.drop("SalePrice", axis=1)
X_train = ames_train["SalePrice"]
y_train
# create X and y for test dataset
= ames_test.drop("SalePrice", axis=1)
X_test = ames_test["SalePrice"] y_test
Sample Statistics
Before modeling, be sure to look at the data. Calculate the summary statistics requested on PrairieLearn and create a visualization for your report.
Models
For this lab you will select one model to submit to the autograder. You may use any modeling techniques you’d like. The only rules are:
- Models must start from the given training data, unmodified.
- Importantly, the types and shapes of
X_train
andy_train
should not be changed. - In the autograder, we will call
mod.predict(X_test)
on your model, where your model is loaded asmod
andX_test
has a compatible shape with and the same variable names and types asX_train
. - We assume that you will use a
Pipeline
andGridSearchCV
fromsklearn
as you will need to deal with heterogeneous data, and you should be using cross-validation.- So more specifically, you should create a
Pipeline
that is fit withGridSearchCV
. Done correctly, this will store a “model” that you can submit to the autograder.
- So more specifically, you should create a
- Importantly, the types and shapes of
- Your model must have a
fit
method. - Your model must have a
predict
method that returns numbers. - Your model should be created with
scikit-learn
version1.4.0
or newer. - Your model should be serialized with
joblib
version1.3.2
or newer.- Your serialized model must be less than 5MB.
We will use MAPE (mean absolute percentage error) to assess your submitted model.
\[ \text{MAPE}(y, \hat{y}) = \frac{1}{n} \sum_{i=1}^{n} \frac{{}\left| y_i - \hat{y}_i \right|}{\left| y_i \right|} \]
To obtain the maximum points via the autograder, your model performance must meet or exceed:
Test MAPE: 0.087
Production MAPE: 0.087
Model Persistence
To save your models for submission to the autograder, use the dump
function from the joblib
library. Check PrairieLearn for the filename that the autograder expects.
Discussion
You do not need to write a discussion for this lab! Remember, you can just submit the template notebook for full points!
If you were writing a discussion, some things that you should think about include…
- Are there any variables that obviously should not be included?
- Would you put this model into practice today? (We hope not!)
- Would you put this model into practice in 2011? (Maybe!)
- Where would it be appropriate to use this model? (Ames, Iowa!)
- Even though we used MAPE, and even considering percentage error, does your model perform equally well for low-cost and high-cost homes?
Template Notebook
Submission
On Canvas, be sure to submit both your source .ipynb
file and a rendered .html
version of the report.