Lab 09: Ames Home Prices

Scenario: You work for an online real estate listing aggregator, perhaps similar to a well known company whose name rhymes with Pillow. You are tasked with predicting the sale value of homes given features of the home such as size, number of bathrooms, etc. Users of the website can use your predictions to evaluate actual list prices if they are buying. If they are selling, they can use your predictions when considering at what price to list their home.

You are not required to writeup a “complete” report for this lab! It’s been a long semester and you’ve already written many reports! Simply submit the (blank) template to Canvas for full points.

Goal

The goal of this lab is to develop a model to predict the selling price of homes in Ames, Iowa, a regression task.

Data

The data (and task) for this lab originally comes from Kaggle.

You should not use this data, but instead the data provided below. However, the descriptions of the variables found on Kaggle will be useful. (We will not repeat them here as there are many!)

Data in Python

To load the data in Python, use:

import pandas as pd
ames_train = pd.read_csv("https://cs307.org/lab-09/data/ames-train.csv")
ames_test = pd.read_csv("https://cs307.org/lab-09/data/ames-test.csv")

Prepare Data for Machine Learning

Because the data is already train-test split, we can simply create the X and y variants of the data.

# create X and y for train dataset
X_train = ames_train.drop("SalePrice", axis=1)
y_train = ames_train["SalePrice"]

# create X and y for test dataset
X_test = ames_test.drop("SalePrice", axis=1)
y_test = ames_test["SalePrice"]

Sample Statistics

Before modeling, be sure to look at the data. Calculate the summary statistics requested on PrairieLearn and create a visualization for your report.

Models

For this lab you will select one model to submit to the autograder. You may use any modeling techniques you’d like. The only rules are:

  • Models must start from the given training data, unmodified.
    • Importantly, the types and shapes of X_train and y_train should not be changed.
    • In the autograder, we will call mod.predict(X_test) on your model, where your model is loaded as mod and X_test has a compatible shape with and the same variable names and types as X_train.
    • We assume that you will use a Pipeline and GridSearchCV from sklearn as you will need to deal with heterogeneous data, and you should be using cross-validation.
      • So more specifically, you should create a Pipeline that is fit with GridSearchCV. Done correctly, this will store a “model” that you can submit to the autograder.
  • Your model must have a fit method.
  • Your model must have a predict method that returns numbers.
  • Your model should be created with scikit-learn version 1.4.0 or newer.
  • Your model should be serialized with joblib version 1.3.2 or newer.
    • Your serialized model must be less than 5MB.

While you can use any modeling technique, each lab is designed such that a model using only techniques seen so far in the course can pass the checks in the autograder.

We will use MAPE (mean absolute percentage error) to assess your submitted model.

\[ \text{MAPE}(y, \hat{y}) = \frac{1}{n} \sum_{i=1}^{n} \frac{{}\left| y_i - \hat{y}_i \right|}{\left| y_i \right|} \]

To obtain the maximum points via the autograder, your model performance must meet or exceed:

Test MAPE: 0.087
Production MAPE: 0.087

The production data is data that will mimic data that is passed through your model after you have put it into production, that is, it is being used for the stated goal within the scenario of the lab. As such, you do not have access to it. You do however have access to the test data.

Model Persistence

To save your models for submission to the autograder, use the dump function from the joblib library. Check PrairieLearn for the filename that the autograder expects.

Because the models for this lab could be quite large, consider using the compress parameter to the dump function!

Discussion

You do not need to write a discussion for this lab! Remember, you can just submit the template notebook for full points!

If you were writing a discussion, some things that you should think about include…

  • Are there any variables that obviously should not be included?
  • Would you put this model into practice today? (We hope not!)
    • Would you put this model into practice in 2011? (Maybe!)
    • Where would it be appropriate to use this model? (Ames, Iowa!)
  • Even though we used MAPE, and even considering percentage error, does your model perform equally well for low-cost and high-cost homes?

Template Notebook

Submission

Before submission, especially of your report, you should be sure to review the Lab Policy page!

On Canvas, be sure to submit both your source .ipynb file and a rendered .html version of the report.