Lab 02: Credit Ratings

For Lab 02, you will use credit data to develop a model that will predict a banking customer’s credit score.

Rather go to bed without dinner than to rise in debt.

– Benjamin Franklin

Background

A credit rating is an evaluation of the risk associated with loaning money or extending credit to a potential debtor. In this case, we are considering credit scores of individual consumers, but credit ratings in general can apply to both individuals or organizations, even entire states and countries.

Scenario and Goal

Suppose you work as an analyst for a small local bank, perhaps a credit union, that has several loan offerings. For years, the bank relied on credit agencies to provide a rating of your customers’ credit, however, this costs your bank money. One day, you realize that it might be possible to reverse engineer your customers’ (and thus potential customers) credit rating based on the credit ratings that you have already purchased, as well as the income and demographic information that you already have, such as age, education level, etc. Your goal is to create a regression model that predicts the credit rating for an individual based on income and demographic information.

Data

To achieve the goal of this lab, we will need historical credit data. The necessary data is provided in the following files:

Source

The original source of the credit data is the textbook An Introduction to Statistical Learning. Note that this is simulated data, not real data.

The specific data used here is modified from the original source, which can be found on GitHub:

GitHub: ISLP Credit Data

The data can also be obtained via the ISLP Python package, but we do not recommend doing so as installing it will downgrade the version of several packages used in this course. However, the package does include some useful documentation.

ISLP Documentation: Credit Card Balance Data

We note that currently the documentation does not seem to properly match the existing data.

The data was originally designed to be used to predict the balance on credit cards, but we have repurposed the data for CS 307. As such, we have removed several variables from the original data. Additionally, we have poisoned the data with a non-trivial amount of missing data.

Data Dictionary

Each observation in the train, test, and (hidden) production data contains information about a particular banking customer. For the purposes of this lab, we assume that these are from a bank operating and with customers located in the United States.

Response

Rating

[float64] credit rating, specifically the credit score of an individual consumer

Features

Income

[float64] yearly income in $1000s

Age

[float64] age

Education

[float64] years of education completed

Gender

[object] gender

Student

[object] a Yes / No variable with Yes indicating an individual is a student

Married

[object] a Yes / No variable with Yes indicating an individual is married

Ethnicity

[object] ethnicity

Data in Python

To load the data in Python, use:

import pandas as pd

credit_train = pd.read_csv("https://cs307.org/lab-02/data/credit-train.csv")
credit_test = pd.read_csv("https://cs307.org/lab-02/data/credit-test.csv")

credit_train

	Rating	Income	Age	Education	Gender	Student	Married	Ethnicity
0	257.0	44.473	81.0	16.0	Female	No	No	NaN
1	353.0	41.532	50.0	NaN	Male	No	Yes	Caucasian
2	388.0	16.479	26.0	16.0	Male	NaN	No	NaN
3	321.0	10.793	29.0	13.0	Male	No	No	Caucasian
4	367.0	76.273	65.0	14.0	Female	No	Yes	Caucasian
...	...	...	...	...	...	...	...	...
251	268.0	26.370	78.0	11.0	Male	No	Yes	Asian
252	433.0	26.427	50.0	15.0	Female	Yes	Yes	Asian
253	259.0	12.031	58.0	18.0	Female	NaN	Yes	Caucasian
254	335.0	80.861	29.0	15.0	Female	No	Yes	Asian
255	93.0	15.717	38.0	16.0	Male	Yes	Yes	Caucasian

256 rows × 8 columns

Prepare Data for Machine Learning

Create the X and y variants of the data for use with sklearn:

# create X and y for train
X_train = credit_train.drop("Rating", axis=1)
y_train = credit_train["Rating"]

# create X and y for test
X_test = credit_test.drop("Rating", axis=1)
y_test = credit_test["Rating"]

You can assume that within the autograder, similar processing is performed on the production data.

Sample Statistics

Before modeling, be sure to look at the data. Calculate the summary statistics requested on PrairieLearn and create a visualization for your report.

Models

For this lab you will select one model to submit to the autograder. You may use any modeling techniques you’d like. The only rules are:

Models must start from the given training data, unmodified.
- Importantly, the types and shapes of X_train and y_train should not be changed.
- In the autograder, we will call mod.predict(X_test) on your model, where your model is loaded as mod and X_test has a compatible shape with and the same variable names and types as X_train.
- In the autograder, we will call mod.predict(X_prod) on your model, where your model is loaded as mod and X_prod has a compatible shape with and the same variable names and types as X_train.
- We assume that you will use a Pipeline and GridSearchCV from sklearn as you will need to deal with heterogeneous data, and you should be using cross-validation.
  - So more specifically, you should create a Pipeline that is fit with GridSearchCV. Done correctly, this will store a “model” that you can submit to the autograder.
Your model must have a fit method.
Your model must have a predict method.
Your model should be created with scikit-learn version 1.5.1 or newer.
Your model should be serialized with joblib version 1.4.2 or newer.
- Your serialized model must be less than 5MB.

While you can use any modeling technique, each lab is designed such that a model using only techniques seen so far in the course can pass the checks in the autograder.

To obtain the maximum points via the autograder, your model performance must meet or exceed:

Test RMSE: 110.0
Production RMSE: 110.0

Model Persistence

To save your model for submission to the autograder, use the dump function from the joblib library. Check PrairieLearn for the filename that the autograder expects for this lab.

from joblib import dump
dump(mod, "filename.joblib")

Discussion

As always, be sure to state a conclusion, that is, whether or not you would use the model you trained and selected for the real world scenario described at the start of the lab! Justify your conclusion. If you trained multiple models that are mentioned in your report, first make clear which model you selected and are considering for use in practice.

Additional discussion prompts:

Ignoring model performance, does it seem appropriate to use these features for the stated goal of these models?
- Is it legal to do so? Is it ethical to do so?
- A very important lesson to learn now that you will have the power of machine learning: Just because you can, does not mean you should.

When answering discussion prompts: Do not simply answer the prompt! Answer the prompt, but write as if the prompt did not exist. Write your report as if the person reading it did not have access to this document!

Template Notebook

Lab 02: Template Notebook

Submission

Before submission, especially of your report, you should be sure to review the Lab Policy page.

On Canvas, be sure to submit both your source .ipynb file and a rendered .html version of the report.