Lab 04: Ames Home Prices

For Lab 04, you will use housing data to develop a model to predict the selling price of homes in Ames, Iowa.

You are not required to writeup a “complete” report for this lab! For this lab, we will focus on working with sklearn pipelines. Simply submit the (blank) template to Canvas for full points.

Background

People live in homes! The price of these homes is extremely important. For many buyers, a home will be the largest purchase of their life, often requiring a mortgage with a repayment schedule spread over 30 years.

Scenario and Goal

You work for an online real estate listing aggregator, perhaps similar to a well known company whose name rhymes with Pillow. You are tasked with predicting the sale value of homes given features of the home such as size, number of bathrooms, etc. Users of the website can use your predictions to evaluate actual list prices if they are buying. If they are selling, they can use your predictions when considering at what price to list their home.

Data

To achieve the goal of this lab, we will need housing data from Ames, Iowa. The necessary data is provided in the following files:

Source

The data (and task) for this lab originally comes from Kaggle.

You should not use that data, but instead the data provided in this lab.

Data Dictionary

Each observation in the train, test, and (hidden) production data contains information about a particular home in Ames, Iowa that transacted between 2006 and 2010.

Original and complete documentation for this data can be found on Kaggle.

Response

SalePrice

  • [int64] Sale price

Features

Order

  • [int64] Observation number

PID

  • [int64] Parcel identification number - can be used with city web site for parcel review

MS SubClass

  • [int64] Identifies the type of dwelling involved in the sale

MS Zoning

  • [object] Identifies the general zoning classification of the sale.

Lot Frontage

  • [float64] Linear feet of street connected to property

Lot Area

  • [int64] Lot size in square feet

Street

  • [object] Type of road access to property

Alley

  • [object] Type of alley access to property

Lot Shape

  • [object] General shape of property

Land Contour

  • [object] Flatness of the property

Utilities

  • [object] Type of utilities available

Lot Config

  • [object] Lot configuration

Land Slope

  • [object] Slope of property

Neighborhood

  • [object] Physical locations within Ames city limits (map available)

Condition 1

  • [object] Proximity to various conditions

Condition 2

  • [object] Proximity to various conditions (if more than one is present)

Bldg Type

  • [object] Type of dwelling

House Style

  • [object] Style of dwelling

Overall Qual

  • [int64] Rates the overall material and finish of the house

Overall Cond

  • [int64] Rates the overall condition of the house

Year Built

  • [int64] Original construction date

Year Remod/Add

  • [int64] Remodel date (same as construction date if no remodeling or additions)

Roof Style

  • [object] Type of roof

Roof Matl

  • [object] Roof material

Exterior 1st

  • [object] Exterior covering on house

Exterior 2nd

  • [object] Exterior covering on house (if more than one material)

Mas Vnr Type

  • [object] Masonry veneer type

Mas Vnr Area

  • [float64] Masonry veneer area in square feet

Exter Qual

  • [object] Evaluates the quality of the material on the exterior

Exter Cond

  • [object] Evaluates the present condition of the material on the exterior

Foundation

  • [object] Type of foundation

Bsmt Qual

  • [object] Evaluates the height of the basement

Bsmt Cond

  • [object] Evaluates the general condition of the basement

Bsmt Exposure

  • [object] Refers to walkout or garden level walls

BsmtFin Type 1

  • [object] Rating of basement finished area

BsmtFin SF 1

  • [float64] Type 1 finished square feet

BsmtFin Type 2

  • [object] Rating of basement finished area (if multiple types)

BsmtFin SF 2

  • [float64] Type 2 finished square feet

Bsmt Unf SF

  • [float64] Unfinished square feet of basement area

Total Bsmt SF

  • [float64] Total square feet of basement area

Heating

  • [object] Type of heating

Heating QC

  • [object] Heating quality and condition

Central Air

  • [object] Central air conditioning

Electrical

  • [object] Electrical system

1st Flr SF

  • [int64] First Floor square feet

2nd Flr SF

  • [int64] Second floor square feet

Low Qual Fin SF

  • [int64] Low quality finished square feet (all floors)

Gr Liv Area

  • [int64] Above grade (ground) living area square feet

Bsmt Full Bath

  • [float64] Basement full bathrooms

Bsmt Half Bath

  • [float64] Basement half bathrooms

Full Bath

  • [int64] Full bathrooms above grade

Half Bath

  • [int64] Half baths above grade

Bedroom AbvGr

  • [int64] Bedrooms above grade (does not include basement bedrooms)

Kitchen AbvGr

  • [int64] Kitchens above grade

Kitchen Qual

  • [object] Kitchen quality

TotRms AbvGrd

  • [int64] Total rooms above grade (does not include bathrooms)

Functional

  • [object] Home functionality (Assume typical unless deductions are warranted)

Fireplaces

  • [int64] Number of fireplaces

Fireplace Qu

  • [object] Fireplace quality

Garage Type

  • [object] Garage location

Garage Yr Blt

  • [float64] Year garage was built

Garage Finish

  • [object] Interior finish of the garage

Garage Cars

  • [float64] Size of garage in car capacity

Garage Area

  • [float64] Size of garage in square feet

Garage Qual

  • [object] Garage quality

Garage Cond

  • [object] Garage condition

Paved Drive

  • [object] Paved driveway

Wood Deck SF

  • [int64] Wood deck area in square feet

Open Porch SF

  • [int64] Open porch area in square feet

Enclosed Porch

  • [int64] Enclosed porch area in square feet

3Ssn Porch

  • [int64] Three season porch area in square feet

Screen Porch

  • [int64] Screen porch area in square feet

Pool Area

  • [int64] Pool area in square feet

Pool QC

  • [object] Pool quality

Fence

  • [object] Fence quality

Misc Feature

  • [object] Miscellaneous feature not covered in other categories

Misc Val

  • [int64] Value of miscellaneous feature

Mo Sold

  • [int64] Month Sold

Yr Sold

  • [int64] Year Sold

Sale Type

  • [object] Type of sale

Sale Condition

  • [object] Condition of sale

Data in Python

To load the data in Python, use:

import pandas as pd
ames_train = pd.read_csv(
    "https://cs307.org/lab-04/data/ames-train.csv",
)
ames_test = pd.read_csv(
    "https://cs307.org/lab-04/data/ames-test.csv",
)

Prepare Data for Machine Learning

Create the X and y variants of the data for use with sklearn:

# create X and y for train dataset
X_train = ames_train.drop("SalePrice", axis=1)
y_train = ames_train["SalePrice"]

# create X and y for test dataset
X_test = ames_test.drop("SalePrice", axis=1)
y_test = ames_test["SalePrice"]

You can assume that within the autograder, similar processing is performed on the production data.

Sample Statistics

Before modeling, be sure to look at the data. Calculate the summary statistics requested on PrairieLearn and create a visualization for your report.

Models

For this lab you will select one model to submit to the autograder. You may use any modeling techniques you’d like. The only rules are:

  • Models must start from the given training data, unmodified.
    • Importantly, the types and shapes of X_train and y_train should not be changed.
    • In the autograder, we will call mod.predict(X_test) on your model, where your model is loaded as mod and X_test has a compatible shape with and the same variable names and types as X_train.
    • In the autograder, we will call mod.predict(X_prod) on your model, where your model is loaded as mod and X_prod has a compatible shape with and the same variable names and types as X_train.
    • We assume that you will use a Pipeline and GridSearchCV from sklearn as you will need to deal with heterogeneous data, and you should be using cross-validation.
      • So more specifically, you should create a Pipeline that is fit with GridSearchCV. Done correctly, this will store a “model” that you can submit to the autograder.
  • Your model must have a fit method.
  • Your model must have a predict method.
  • Your model should be created with scikit-learn version 1.5.2 or newer.
  • Your model should be serialized with joblib version 1.4.2 or newer.
    • Your serialized model must be less than 5MB.

While you can use any modeling technique, each lab is designed such that a model using only techniques seen so far in the course can pass the checks in the autograder.

We will use MAPE (mean absolute percentage error) to assess your submitted model.

\[ \text{MAPE}(y, \hat{y}) = \frac{1}{n} \sum_{i=1}^{n} \frac{{}\left| y_i - \hat{y}_i \right|}{\left| y_i \right|} \]

To obtain the maximum points via the autograder, your model performance must meet or exceed:

Test MAPE: 0.085
Production MAPE: 0.085

Model Persistence

To save your model for submission to the autograder, use the dump function from the joblib library. Check PrairieLearn for the filename that the autograder expects for this lab.

from joblib import dump
dump(mod, "filename.joblib")

Because the models for this lab could be quite large, consider using the compress parameter to the dump function!

Discussion

You do not need to write a discussion for this lab! Remember, you can just submit the template notebook for full points!

As always, be sure to state a conclusion, that is, whether or not you would use the model you trained and selected for the real world scenario described at the start of the lab! Justify your conclusion. If you trained multiple models that are mentioned in your report, first make clear which model you selected and are considering for use in practice.

Additional discussion prompts:

  • Are there any variables that obviously should not be included? (Yes!)
    • This could possibly be discussed in the data section.
  • Would you put this model into practice today? (We hope not!)
    • Would you put this model into practice in 2011? (Maybe!)
    • Where would it be appropriate to use this model? (Ames, Iowa!)
  • Even though we used MAPE, and even considering percentage error, does your model perform equally well for low-cost and high-cost homes?

When answering discussion prompts: Do not simply answer the prompt! Answer the prompt, but write as if the prompt did not exist. Write your report as if the person reading it did not have access to this document!

Template Notebook

Submission

Before submission, especially of your report, you should be sure to review the Lab Policy page.

On Canvas, be sure to submit both your source .ipynb file and a rendered .html version of the report.