Lab 06: Credit Card Fraud

Scenario: You work for a bank that issues credit cards to their customers. You are tasked with automatically detecting fraudulent transactions.

Goal

The goal of this lab is to create a model that will act as a fraud detector that can be used as a part of an automated banking system. It should predict whether or not each credit card transaction is fraud or not.

Data

The data for this lab originally comes from Kaggle. Citations for the data can be found on Kaggle.

A brief description of the target variable is given.

This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.

Similarly, a brief description of the feature variables is given.

It contains only numerical input variables which are the result of a PCA transformation. Unfortunately, due to confidentiality issues, we cannot provide the original features and more background information about the data. Features V1, V2, … V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are ‘Time’ and ‘Amount’. Feature ‘Time’ contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature ‘Amount’ is the transaction Amount, this feature can be used for example-dependant cost-sensitive learning. Feature ‘Class’ is the response variable and it takes value 1 in case of fraud and 0 otherwise.

We are providing a modified version of this data for this lab.

Modifications include:

  • Removed the Time variable as it is misleading.
  • Reduced the number of samples, while maintaining the number of fraudulent transactions.
    • The class imbalance is reduced, but the target is still highly imbalanced.
  • Withheld some data that will be considered the production data.
  • Renamed the target variable from Class to Fraud.
  • Renamed the PCA transformed variables.

Principal Component Analysis (PCA) is a method that we will learn about later in the course. For now, know that it takes some number of features as inputs, and outputs either the same or fewer features, that retain most of the original information in the features. You can assume things like location and type of purchase were among the original input features. (Ever had a credit card transaction denied while traveling?)

Response

Fraud

  • [int64] status of the transaction. 1 indicates a fraudulent transaction and 0 indicates not fraud, a genuine transaction.

Features

Amount

  • [float64] amount (in dollars) of the transaction.

PC01 - PC28

  • [float64] the 28 principal components that encode information such as location and type of purchase while preserving customer privacy.

Data in Python

To load the data in Python, use:

import pandas as pd
fraud = pd.read_csv("https://cs307.org/lab-06/data/fraud.csv")

Prepare Data for Machine Learning

First, train-test split the data by using:

from sklearn.model_selection import train_test_split
fraud_train, fraud_test = train_test_split(
    fraud,
    test_size=0.20,
    random_state=42,
    stratify=fraud["Fraud"],
)

Then, to create the X and y variants of the data, use:

# create X and y for train
X_train = fraud_train.drop("Fraud", axis=1)
y_train = fraud_train["Fraud"]

# create X and y for test
X_test = fraud_test.drop("Fraud", axis=1)
y_test = fraud_test["Fraud"]

Sample Statistics

Before modeling, be sure to look at the data. Calculate the summary statistics requested on PrairieLearn and create a visualization for your report.

Models

For this lab you will select one model to submit to the autograder. You may use any modeling techniques you’d like. The only rules are:

  • Models must start from the given training data, unmodified.
    • Importantly, the types and shapes of X_train and y_train should not be changed.
    • In the autograder, we will call mod.predict(X_test) on your model, where your model is loaded as mod and X_test has a compatible shape with and the same variable names and types as X_train.
    • We assume that you will use a Pipeline and GridSearchCV from sklearn as you will need to deal with heterogeneous data, and you should be using cross-validation.
      • So more specifically, you should create a Pipeline that is fit with GridSearchCV. Done correctly, this will store a “model” that you can submit to the autograder.
  • Your model must have a fit method.
  • Your model must have a predict method that returns numbers.
  • Your model must have a predict_proba method that returns numbers.
  • Your model should be created with scikit-learn version 1.4.0 or newer.
  • Your model should be serialized with joblib version 1.3.2 or newer.
    • Your serialized model must be less than 5MB.

While you can use any modeling technique, each lab is designed such that a model using only techniques seen so far in the course can pass the checks in the autograder.

To obtain the maximum points via the autograder, your model performance must meet or exceed:

Test Precision: 0.7
Test Recall: 0.83
Production Precision: 0.7
Production Recall: 0.83

The production data is data that will mimic data that is passed through your model after you have put it into production, that is, it is being used for the stated goal within the scenario of the lab. As such, you do not have access to it. You do however have access to the test data.

For this lab, be aware that the class imbalance means that a single test set gives a highly variable evaluation of model performance. As such, you may want to consider using the test set as more of a “sanity check” and rely more on cross-validated metrics as an estimation of future performance.

Model Persistence

To save your model for submission to the autograder, use the dump function from the joblib library. Check PrairieLearn for the filename that the autograder expects.

Discussion

As always, be sure to state a conclusion, that is, whether or not you would use the model you trained and selected for the real world scenario described at the start of the lab! If you are asked to train multiple models, first make clear which model you selected and are considering for use in practice. Discuss any limitations or potential improvements.

Additional discussion topics:

  • Which is more important, precision or recall? (The performance cutoffs above suggest an answer to this question, but you do not necessarily need to agree with it.)

When answering discussion prompts: Do not simply answer the prompt! Answer the prompt, but write as if the prompt did not exist. Write your report as if the person reading it did not have access to this document!

Template Notebook

Submission

Before submission, especially of your report, you should be sure to review the Lab Policy page!

On Canvas, be sure to submit both your source .ipynb file and a rendered .html version of the report.