import pandas as pd
Lab 04: Ames Home Prices
For Lab 04, you will use housing data to develop a model to predict the selling price of homes in Ames, Iowa.
Background
People live in homes! The price of these homes is extremely important. For many buyers, a home will be the largest purchase of their life, often requiring a mortgage with a repayment schedule spread over 30 years.
Scenario and Goal
You work for an online real estate listing aggregator, perhaps similar to a well known company whose name rhymes with Pillow. You are tasked with predicting the sale value of homes given features of the home such as size, number of bathrooms, etc. Users of the website can use your predictions to evaluate actual list prices if they are buying. If they are selling, they can use your predictions when considering at what price to list their home.
Data
To achieve the goal of this lab, we will need housing data from Ames, Iowa. The necessary data is provided in the following files:
Source
The data (and task) for this lab originally comes from Kaggle.
You should not use that data, but instead the data provided in this lab.
Data Dictionary
Each observation in the train, test, and (hidden) production data contains information about a particular home in Ames, Iowa that transacted between 2006 and 2010.
Original and complete documentation for this data can be found on Kaggle.
Response
SalePrice
[int64]
Sale price
Features
Order
[int64]
Observation number
PID
[int64]
Parcel identification number - can be used with city web site for parcel review
MS SubClass
[int64]
Identifies the type of dwelling involved in the sale
MS Zoning
[object]
Identifies the general zoning classification of the sale.
Lot Frontage
[float64]
Linear feet of street connected to property
Lot Area
[int64]
Lot size in square feet
Street
[object]
Type of road access to property
Alley
[object]
Type of alley access to property
Lot Shape
[object]
General shape of property
Land Contour
[object]
Flatness of the property
Utilities
[object]
Type of utilities available
Lot Config
[object]
Lot configuration
Land Slope
[object]
Slope of property
Neighborhood
[object]
Physical locations within Ames city limits (map available)
Condition 1
[object]
Proximity to various conditions
Condition 2
[object]
Proximity to various conditions (if more than one is present)
Bldg Type
[object]
Type of dwelling
House Style
[object]
Style of dwelling
Overall Qual
[int64]
Rates the overall material and finish of the house
Overall Cond
[int64]
Rates the overall condition of the house
Year Built
[int64]
Original construction date
Year Remod/Add
[int64]
Remodel date (same as construction date if no remodeling or additions)
Roof Style
[object]
Type of roof
Roof Matl
[object]
Roof material
Exterior 1st
[object]
Exterior covering on house
Exterior 2nd
[object]
Exterior covering on house (if more than one material)
Mas Vnr Type
[object]
Masonry veneer type
Mas Vnr Area
[float64]
Masonry veneer area in square feet
Exter Qual
[object]
Evaluates the quality of the material on the exterior
Exter Cond
[object]
Evaluates the present condition of the material on the exterior
Foundation
[object]
Type of foundation
Bsmt Qual
[object]
Evaluates the height of the basement
Bsmt Cond
[object]
Evaluates the general condition of the basement
Bsmt Exposure
[object]
Refers to walkout or garden level walls
BsmtFin Type 1
[object]
Rating of basement finished area
BsmtFin SF 1
[float64]
Type 1 finished square feet
BsmtFin Type 2
[object]
Rating of basement finished area (if multiple types)
BsmtFin SF 2
[float64]
Type 2 finished square feet
Bsmt Unf SF
[float64]
Unfinished square feet of basement area
Total Bsmt SF
[float64]
Total square feet of basement area
Heating
[object]
Type of heating
Heating QC
[object]
Heating quality and condition
Central Air
[object]
Central air conditioning
Electrical
[object]
Electrical system
1st Flr SF
[int64]
First Floor square feet
2nd Flr SF
[int64]
Second floor square feet
Low Qual Fin SF
[int64]
Low quality finished square feet (all floors)
Gr Liv Area
[int64]
Above grade (ground) living area square feet
Bsmt Full Bath
[float64]
Basement full bathrooms
Bsmt Half Bath
[float64]
Basement half bathrooms
Full Bath
[int64]
Full bathrooms above grade
Half Bath
[int64]
Half baths above grade
Bedroom AbvGr
[int64]
Bedrooms above grade (does not include basement bedrooms)
Kitchen AbvGr
[int64]
Kitchens above grade
Kitchen Qual
[object]
Kitchen quality
TotRms AbvGrd
[int64]
Total rooms above grade (does not include bathrooms)
Functional
[object]
Home functionality (Assume typical unless deductions are warranted)
Fireplaces
[int64]
Number of fireplaces
Fireplace Qu
[object]
Fireplace quality
Garage Type
[object]
Garage location
Garage Yr Blt
[float64]
Year garage was built
Garage Finish
[object]
Interior finish of the garage
Garage Cars
[float64]
Size of garage in car capacity
Garage Area
[float64]
Size of garage in square feet
Garage Qual
[object]
Garage quality
Garage Cond
[object]
Garage condition
Paved Drive
[object]
Paved driveway
Wood Deck SF
[int64]
Wood deck area in square feet
Open Porch SF
[int64]
Open porch area in square feet
Enclosed Porch
[int64]
Enclosed porch area in square feet
3Ssn Porch
[int64]
Three season porch area in square feet
Screen Porch
[int64]
Screen porch area in square feet
Pool Area
[int64]
Pool area in square feet
Pool QC
[object]
Pool quality
Fence
[object]
Fence quality
Misc Feature
[object]
Miscellaneous feature not covered in other categories
Misc Val
[int64]
Value of miscellaneous feature
Mo Sold
[int64]
Month Sold
Yr Sold
[int64]
Year Sold
Sale Type
[object]
Type of sale
Sale Condition
[object]
Condition of sale
Data in Python
To load the data in Python, use:
= pd.read_csv(
ames_train "https://cs307.org/lab-04/data/ames-train.csv",
)= pd.read_csv(
ames_test "https://cs307.org/lab-04/data/ames-test.csv",
)
Prepare Data for Machine Learning
Create the X
and y
variants of the data for use with sklearn
:
# create X and y for train dataset
= ames_train.drop("SalePrice", axis=1)
X_train = ames_train["SalePrice"]
y_train
# create X and y for test dataset
= ames_test.drop("SalePrice", axis=1)
X_test = ames_test["SalePrice"] y_test
You can assume that within the autograder, similar processing is performed on the production data.
Sample Statistics
Before modeling, be sure to look at the data. Calculate the summary statistics requested on PrairieLearn and create a visualization for your report.
Models
For this lab you will select one model to submit to the autograder. You may use any modeling techniques you’d like. The only rules are:
- Models must start from the given training data, unmodified.
- Importantly, the types and shapes of
X_train
andy_train
should not be changed. - In the autograder, we will call
mod.predict(X_test)
on your model, where your model is loaded asmod
andX_test
has a compatible shape with and the same variable names and types asX_train
. - In the autograder, we will call
mod.predict(X_prod)
on your model, where your model is loaded asmod
andX_prod
has a compatible shape with and the same variable names and types asX_train
. - We assume that you will use a
Pipeline
andGridSearchCV
fromsklearn
as you will need to deal with heterogeneous data, and you should be using cross-validation.- So more specifically, you should create a
Pipeline
that is fit withGridSearchCV
. Done correctly, this will store a “model” that you can submit to the autograder.
- So more specifically, you should create a
- Importantly, the types and shapes of
- Your model must have a
fit
method. - Your model must have a
predict
method. - Your model should be created with
scikit-learn
version1.5.2
or newer. - Your model should be serialized with
joblib
version1.4.2
or newer.- Your serialized model must be less than 5MB.
We will use MAPE (mean absolute percentage error) to assess your submitted model.
\[ \text{MAPE}(y, \hat{y}) = \frac{1}{n} \sum_{i=1}^{n} \frac{{}\left| y_i - \hat{y}_i \right|}{\left| y_i \right|} \]
To obtain the maximum points via the autograder, your model performance must meet or exceed:
Test MAPE: 0.085
Production MAPE: 0.085
Model Persistence
To save your model for submission to the autograder, use the dump
function from the joblib
library. Check PrairieLearn for the filename that the autograder expects for this lab.
from joblib import dump
"filename.joblib") dump(mod,
Discussion
As always, be sure to state a conclusion, that is, whether or not you would use the model you trained and selected for the real world scenario described at the start of the lab! Justify your conclusion. If you trained multiple models that are mentioned in your report, first make clear which model you selected and are considering for use in practice.
Additional discussion prompts:
- Are there any variables that obviously should not be included? (Yes!)
- This could possibly be discussed in the data section.
- Would you put this model into practice today? (We hope not!)
- Would you put this model into practice in 2011? (Maybe!)
- Where would it be appropriate to use this model? (Ames, Iowa!)
- Even though we used MAPE, and even considering percentage error, does your model perform equally well for low-cost and high-cost homes?
When answering discussion prompts: Do not simply answer the prompt! Answer the prompt, but write as if the prompt did not exist. Write your report as if the person reading it did not have access to this document!
Template Notebook
Submission
On Canvas, be sure to submit both your source .ipynb
file and a rendered .html
version of the report.