Credit Ratings

This page presents information about the Credit Ratings dataset which will be used as a part of Lab 02 in CS 307.

Rather go to bed without dinner than to rise in debt.

– Benjamin Franklin

Source

The original source of the Credit Ratings data is the textbook An Introduction to Statistical Learning. Note that this is simulated data, not real data.

The specific data used here is modified from the original source, which can be found on GitHub:

The data can also be obtained via the ISLP Python package, but we do not recommend doing so as installing it will downgrade the version of several packages used in this course. However, the package does include some useful documentation.

We note that currently the documentation does not seem to properly match the existing data.

Modification

The data was originally designed to be used to predict the balance on credit cards, but we have repurposed the data for CS 307. As such, we have removed several variables from the original data.

Additionally, we have poisoned the data with a non-trivial amount of missing data. We have also train-test split the data, making only the training data available. The test data is only available in the PrairieLearn autograder.

Credit Ratings

A credit rating is an evaluation of the risk associated with loaning money or extending credit to a potential debtor. In this case, we are considering credit scores of individual consumers, but credit ratings in general can apply to both individuals or organizations, even entire states and countries.

Data Dictionary

The Credit Rating dataset used in CS 307 will include additional preprocessing for use in Lab 02.

We document that specific data here. Each observation contains information about a particular banking customer. We assume that these are customers operating in the United States.

Rating

  • [float64] credit rating, specifically the credit score of an individual consumer

Income

  • [float64] yearly income in $1000s

Age

  • [int64] age

Education

  • [int64] years of education completed

Gender

  • [object] gender

Student

  • [object] a Yes / No variable with Yes indicating an individual is a student

Married

  • [object] a Yes / No variable with Yes indicating an individual is married

Ethnicity

  • [object] ethnicity

Data for Machine Learning

For CS 307 lab, we will provide training data, stored as a CSV file, accessible via the web.

The test data is only available within the autograder.

Loading the Data

import pandas as pd
credit_train = pd.read_csv("https://cs307.org/lab-02/data/credit-train.csv")
credit_train
Rating Income Age Education Gender Student Married Ethnicity
0 448.0 49.570 28.0 9.0 Female No Yes Asian
1 411.0 26.813 55.0 16.0 Female No No Caucasian
2 181.0 30.406 79.0 14.0 Male No Yes African American
3 169.0 27.349 51.0 16.0 Female No Yes African American
4 292.0 12.068 44.0 18.0 Female No Yes Asian
... ... ... ... ... ... ... ... ...
295 369.0 40.442 81.0 8.0 Female No No African American
296 272.0 NaN 69.0 8.0 Male No Yes Caucasian
297 251.0 14.132 75.0 17.0 Male No No Caucasian
298 747.0 135.118 81.0 15.0 Female No Yes Asian
299 355.0 22.939 NaN NaN Female No Yes Asian

300 rows × 8 columns