import pandas as pd
Lab 07: Credit Fraud
For Lab 07, you will use credit card transaction data to develop a model that will detect fraud.
Background
Every day, millions if not billions of credit card transactions are processed.
Banks must verify that each transaction is genuine, as there are constant attempts by bad actors to make fraudulent transactions.
Scenario and Goal
Who are you?
- You are a data scientist working for a banking institution that issues credit cards to their customers.
What is your task?
- You are tasked with creating an automated fraud detector. As soon as a credit card transaction is made, given the information available at the time of the transaction (location, amount, etc), your model should immediately identify the transaction as fraudulent or genuine. Your goal is to find a model that appropriately balances false positives and false negatives.
Who are you writing for?
- To summarize your work, you will write a report for your manager, who is the head of the loss minimization team. You can assume your manager is very familiar with banking and credit cards, and reasonable familiar with the general concepts of machine learning.
Data
To achieve the goal of this lab, we will need information on previous credit card transactions, including whether or not they were fraudulent. The necessary data is provided in the following files:
Source
The data for this lab originally comes from Kaggle. Citations for the data can be found on Kaggle.
A brief description of the target variable is given.
This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.
Similarly, a brief description of the feature variables is given.
It contains only numerical input variables which are the result of a PCA transformation. Unfortunately, due to confidentiality issues, we cannot provide the original features and more background information about the data. Features V1, V2, … V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are ‘Time’ and ‘Amount’. Feature ‘Time’ contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature ‘Amount’ is the transaction Amount, this feature can be used for example-dependant cost-sensitive learning. Feature ‘Class’ is the response variable and it takes value 1 in case of fraud and 0 otherwise.
We are providing a modified version of this data for this lab.
Modifications include:
- Removed the
Time
variable as it is misleading. - Reduced the number of samples, while maintaining the number of fraudulent transactions.
- The class imbalance is reduced, but the target is still highly imbalanced.
- Withheld some data that will be considered the production data.
- Renamed the target variable from
Class
toFraud
. - Renamed the PCA transformed variables.
Data Dictionary
Each observation in the train, test, and (hidden) production data contains information about a particular credit card transaction.
Response
Fraud
[int64]
status of the transaction.1
indicates a fraudulent transaction and0
indicates not fraud, a genuine transaction.
Features
Amount
[float64]
amount (in dollars) of the transaction.
PC01
- PC28
[float64]
the 28 principal components that encode information such as location and type of purchase while preserving customer privacy.
Data in Python
To load the data in Python, use:
= pd.read_parquet("https://cs307.org/lab-07/data/fraud.parquet") fraud
Prepare Data for Machine Learning
First, train-test split the data by using:
from sklearn.model_selection import train_test_split
= train_test_split(
fraud_train, fraud_test
fraud,=0.20,
test_size=42,
random_state=fraud["Fraud"],
stratify )
Then, to create the X
and y
variants of the data, use:
# create X and y for train
= fraud_train.drop("Fraud", axis=1)
X_train = fraud_train["Fraud"]
y_train
# create X and y for test
= fraud_test.drop("Fraud", axis=1)
X_test = fraud_test["Fraud"] y_test
You can assume that within the autograder, similar processing is performed on the production data.
Sample Statistics
Before modeling, be sure to look at the data. Calculate the summary statistics requested on PrairieLearn and create a visualization for your report.
Models
For this lab you will select one model to submit to the autograder. You may use any modeling techniques you’d like. The only rules are:
- Models must start from the given training data, unmodified.
- Importantly, the types and shapes of
X_train
andy_train
should not be changed. - In the autograder, we will call
mod.predict(X_test)
on your model, where your model is loaded asmod
andX_test
has a compatible shape with and the same variable names and types asX_train
. - In the autograder, we will call
mod.predict(X_prod)
on your model, where your model is loaded asmod
andX_prod
has a compatible shape with and the same variable names and types asX_train
. - We assume that you will use a
Pipeline
andGridSearchCV
fromsklearn
as you will need to deal with heterogeneous data, and you should be using cross-validation.- So more specifically, you should create a
Pipeline
that is fit withGridSearchCV
. Done correctly, this will store a “model” that you can submit to the autograder.
- So more specifically, you should create a
- Importantly, the types and shapes of
- Your model must have a
fit
method. - Your model must have a
predict
method. - Your model must have a
predict_proba
method. - Your model should be created with
scikit-learn
version1.5.2
or newer. - Your model should be serialized with
joblib
version1.4.2
or newer.- Your serialized model must be less than 5MB.
To obtain the maximum points via the autograder, your model performance must meet or exceed:
Test Precision: 0.7
Test Recall: 0.83
Production Precision: 0.7
Production Recall: 0.83
Model Persistence
To save your model for submission to the autograder, use the dump
function from the joblib
library. Check PrairieLearn for the filename that the autograder expects for this lab.
from joblib import dump
"filename.joblib") dump(mod,
Discussion
As always, be sure to state a conclusion, that is, whether or not you would use the model you trained and selected for the real world scenario described at the start of the lab! Justify your conclusion. If you trained multiple models that are mentioned in your report, first make clear which model you selected and are considering for use in practice.
Additional discussion prompts:
- Which is more important, precision or recall? (The performance cutoffs above suggest an answer to this question, but you do not necessarily need to agree with it.)
When answering discussion prompts: Do not simply answer the prompt! Answer the prompt, but write as if the prompt did not exist. Write your report as if the person reading it did not have access to this document!
Template Notebook
Submission
On Canvas, be sure to submit both your source .ipynb
file and a rendered .html
version of the report.