import pandas as pd
Lab 04: Credit
For Lab 04, you will use credit data to develop a model that will predict a banking customer’s credit score.
Background
A credit rating is an evaluation of the risk associated with loaning money or extending credit to a potential debtor. In this case, we are considering credit scores (ratings) of individual consumers, but credit ratings in general can apply to both individuals or organizations, even entire states and countries.
Scenario and Goal
Who are you?
- You work as an analyst for a small local bank, perhaps a credit union, that has several loan offerings.
What is your task?
- The bank currently relies on credit agencies to provide a rating of a potential customers’ credit, however, this costs your bank money. You realize that it might be possible to reverse engineer your customers’ (and thus potential customers) credit rating based on the credit ratings that you have already purchased, as well as the income and demographic information that you know, such as age, education level, etc. (Customers provide this information when applying for a loan.) You decide to develop a model that predicts the credit rating for an individual based on income and demographic information.
Who are you writing for?
- To summarize your work, you will write a report for the bank’s board of directors. They are experts in banking, but not necessarily in data science.
Data
To achieve the goal of this lab, we will need historical credit data. The necessary data is provided in the following files:
Source
The original source of the credit data is the textbook An Introduction to Statistical Learning. Note that this is simulated data, not real data.
The specific data used here is modified from the original source, which can be found on GitHub:
The data can also be obtained via the ISLP
Python package, but we do not recommend doing so as installing it will downgrade the version of several packages used in this course. However, the package does include some useful documentation.
The data was originally designed to be used to predict the balance on credit cards, but we have repurposed the data for CS 307. As such, we have removed several variables from the original data. Additionally, we have poisoned the data with a non-trivial amount of missing data.
Data Dictionary
Each observation in the train, test, and (hidden) production data contains information about a particular banking customer. For the purposes of this lab, we assume that these are from a bank operating and with customers located in the United States.
Response
Rating
[float64]
credit rating, specifically the credit score of an individual consumer
Features
Income
[float64]
yearly income in $1000s
Age
[float64]
age
Education
[float64]
years of education completed
Gender
[object]
gender
Student
[object]
aYes
/No
variable withYes
indicating an individual is a student
Married
[object]
aYes
/No
variable withYes
indicating an individual is married
Ethnicity
[object]
ethnicity
Data in Python
To load the data in Python, use:
= pd.read_parquet("https://cs307.org/lab/data/credit-train.parquet")
credit_train = pd.read_parquet("https://cs307.org/lab/data/credit-test.parquet") credit_test
Prepare Data for Machine Learning
Create the X
and y
variants of the data for use with sklearn
:
# create X and y for train
= credit_train.drop("Rating", axis=1)
X_train = credit_train["Rating"]
y_train
# create X and y for test
= credit_test.drop("Rating", axis=1)
X_test = credit_test["Rating"] y_test
You can assume that within the autograder, similar processing is performed on the production data.
Sample Statistics
Before modeling, be sure to look at the data. Calculate the summary statistics requested on PrairieLearn.
Models
For this lab you will select one model to submit to the autograder. You may use any modeling techniques you’d like, so long as it meets these requirements:
- Your model must start from the given training data, unmodified.
- Importantly, the types and shapes of
X_train
andy_train
should not be changed. - In the autograder, we will call
mod.predict(X_test)
on your model, where your model is loaded asmod
andX_test
has a compatible shape with and the same variable names and types asX_train
. - In the autograder, we will call
mod.predict(X_prod)
on your model, where your model is loaded asmod
andX_prod
has a compatible shape with and the same variable names and types asX_train
. - We assume that you will use a
Pipeline
andGridSearchCV
fromsklearn
as you will need to deal with heterogeneous data, and you should be using cross-validation to tune your model.- More specifically, you should create a
Pipeline
that is fit withGridSearchCV
. Done correctly, this will store a tuned model that you can submit to the autograder.
- More specifically, you should create a
- Importantly, the types and shapes of
- Your model must have a
fit
method. - Your model must have a
predict
method. - Your model should be created with
scikit-learn
version1.6.1
or newer. - Your model should be serialized with
joblib
version1.4.2
or newer. - Your serialized model must be less than 5MB.
To obtain the maximum points via the autograder, your model must outperform the following metrics:
Test RMSE: 110.0
Production RMSE: 110.0
Submission
On Canvas, be sure to submit both your source .ipynb
file and a rendered .html
version of the report.