Lab 03: Pitch Classification

Scenario: Suppose you work for Major League Baseball (MLB) as part of the broadcast operations team. You are tasked with automatically displaying the pitch type for each pitch in real-time, both in the stadium, and on the television broadcast. You have access to data on every previous pitch thrown in MLB, including characteristics of the pitch such as its velocity and rotation, as well as the type of pitch thrown. Additionally, tracking technology in each stadium will provide data on (at least) the speed and rotation of each pitch, in real-time.


The goal of this lab is to create a classification model that predicts the pitch type of a pitch thrown by a particular pitcher given the pitch’s velocity, rotation, movement, and position of the batter.


This lab will use data from Statcast.


  • pitch_name


  • release_speed
  • release_spin_rate
  • pfx_x
  • pfx_z
  • stand

Data in Python

To load the data in Python, use:

import pandas as pd
pitches_train = pd.read_csv("")

To create the X and y variants of the training data, use:

# create X and y for train
X_train = pitches_train.drop("pitch_name", axis=1)
y_train = pitches_train["pitch_name"]

Sample Statistics

Before modeling, be sure to look at the data. Calculate the summary statistics requested on PrairieLearn and create a visualization for your report.

Consider using seaborn to create a pairplot or jointplot to visualize this data.


For this lab you will select one model to submit to the autograder. You may use any modeling techniques you’d like. The only rules are:

  • Models must start from the given training data, unmodified.
    • Importantly, the types and shapes of X_train and y_train should not be changed.
    • In the autograder, we will call mod.predict(X_test) on your model, where your model is loaded as mod and X_test has a compatible shape with and the same variable names and types as X_train.
    • We assume that you will use a Pipeline and GridSearchCV from sklearn as you will need to deal with heterogeneous data, and you should be using cross-validation.
      • So more specifically, you should create a Pipeline that is fit with GridSearchCV. Done correctly, this will store a “model” that you can submit to the autograder.
  • Your model must have a fit method.
  • Your model must have a predict method.
  • Your model must have a predict_proba method.
  • Your model should be created with scikit-learn version 1.4.0 or newer.
  • Your model should be serialized with joblib version 1.3.2 or newer.
    • Your serialized model must be less than 5MB.

While you can use any modeling technique, each lab is designed such that a model using only techniques seen so far in the course can pass the check in the autograder.

To obtain the maximum points via the autograder, your model performance must meet or exceed:

Test Accuracy: 0.89

Model Persistence

To save your model for submission to the autograder, use the dump function from the joblib library. Check PrairieLearn for the filename that the autograder expects.


As always, be sure to state a conclusion, that is, whether or not you would use the model you trained and selected for the real world scenario described at the start of the lab! If you are asked to train multiple models, first make clear which model you selected and are considering for use in practice. Discuss any limitations or potential improvements.

Additional discussion topics:

  • Here, we’ve only considered one pitcher. How would this system work given that there are hundreds of pitchers in MLB and several used in one game? Can we create a model for any pitcher that could pitch in a game? (Hint: Every pitcher has to throw their first MLB pitch someday!)
  • We’re using 2022 for training data and 2023 as test data, which is reasonable, as predictions will need to be made for future pitches. However, does this create potential issues? (Hint: Pitchers sometimes learn new pitches! Pitches also change how they throw pitches they already know how to throw during the off-season. This will explain why your test accuracy will likely be a bit lower than a cross-validated accuracy.))
  • Do you think your classifier predicts fast enough to work in real-time?

When answering discussion prompts: Do not simply answer the prompt! Answer the prompt, but write as if the prompt did not exist. Write your report as if the person reading it did not have access to this document!

Template Notebook


Before submission, especially of your report, you should be sure to review the Lab Policy page!

On Canvas, be sure to submit both your source .ipynb file and a rendered .html version of the report.