Lab 03: Pitch Classification

For Lab 03, you will use baseball pitch data to develop a model that will predict a pitch’s “type” based on its characteristics.

It’s tough to make predictions, especially about the future.

Yogi Berra

Background

What is a pitch type you might ask? Well, it’s complicated. Let’s allow someone else to explain:

As we’ll see in a moment, while the pitch type is technically defined by what the pitcher claims they threw, we can probably infer the pitch type based on only speed and spin.

Now that you’re a pitch type expert, here’s a game to see how well you can identify pitches from video:

That game was difficult wasn’t it? Don’t feel bad! Identifying pitches with your eyes is difficult. It is even more difficult when you realize the cameras are playing tricks on you:

But wait! Then how do television broadcasts of baseball games instantly display the pitch type? You guessed it… machine learning! For a deep dive on how they do this, see here:

The long story short is:

  • Have advanced tracking technology that can instantly record speed, spin, and other measurements for each pitch.
  • Have a trained classifier for pitch type based on speed, spin, and more.
  • In real time, make a prediction of the pitch type as soon as the speed and spin are recorded.
  • Display the result in the stadium and on the broadcast!

There are several ways to accomplish this task, but one decision we’ll make is to make one-model-per-pitcher. For the purpose of this lab, we will only model the pitches for a single pitcher, Shohei Ohtani.

Shohei Ohtani is a pitcher that pitched for the 2022 Los Angeles Angels and the 2023 Los Angeles Angels.

Ohtani currently plays for the 2024 Los Angeles Dodgers, but due to injury is not currently pitching. Instead, on September 19, 2024, he became the first player in MLB history to achieve a “50-50” season, that is, 50 home runs and 50 stolen bases. Yes, in addition to being an elite pitcher, he is also setting records as a batter, which is truly unprecedented in baseball.

Scenario and Goal

Suppose you work for Major League Baseball (MLB) as part of the broadcast operations team. You are tasked with automatically displaying the pitch type for each pitch in real-time, both in the stadium, and on the television broadcast. You have access to data on every previous pitch thrown in MLB, including characteristics of the pitch such as its velocity and rotation, as well as the type of pitch thrown. Additionally, tracking technology in each stadium will provide data on (at least) the speed and rotation of each pitch, in real-time.

Your goal is to create a classification model that predicts the pitch type of a pitch thrown by a particular pitcher given the pitch’s velocity, rotation, movement, and position of the batter.

Data

To achieve the goal of this lab, we will need historical pitching data. The necessary data is provided in the following files:

Source

The original source of the data is Statcast. Specifically, the pybaseball package was used to interface with Statcast Search, which is part of Baseball Savant.

Data Dictionary

Each observation contains information about a single pitch thrown by Shohei Ohtani in either 2022 (train data) or 2023 (test data) during an MLB regular season game.

Here, the train-test split is based on time.

  • Train: 2022 MLB Season
  • Test: (First Half of) 2023 MLB Season

Original and (mostly) complete documentation for Statcast data can be found in the Statcast Search CSV Documentation.

Response

pitch_name

  • [object] the name of the pitch, which is the name of the pitch type thrown

Features

release_speed

  • [float64] pitch velocity (miles per hour) measured shortly after leaving the pitcher’s hand

release_spin_rate

  • [float64] pitch spin rate (revolutions per minute) measured shortly after leaving the pitcher’s hand

pfx_x

  • [float64] horizontal movement (feet) of the pitch from the catcher’s perspective.

pfx_z

  • [float64] vertical movement (feet) of the pitch from the catcher’s perspective.

stand

  • [object] side of the plate batter is standing, either L (left) or R (right)

Data in Python

To load the data in Python, use:

import pandas as pd
pitches_train = pd.read_parquet(
    "https://cs307.org/lab-03/data/pitches-train.parquet"
)
pitches_test = pd.read_parquet(
    "https://cs307.org/lab-03/data/pitches-test.parquet"
)
pitches_train
pitch_name release_speed release_spin_rate pfx_x pfx_z stand
0 Sweeper 84.7 2667.0 1.25 0.01 R
1 Sweeper 83.9 2634.0 1.41 0.20 R
2 Sweeper 84.4 2526.0 1.26 0.25 R
3 Curveball 74.3 2389.0 0.93 -1.10 L
4 Sweeper 85.6 2474.0 1.08 0.52 R
... ... ... ... ... ... ...
2623 Split-Finger 91.8 1314.0 -0.30 0.08 R
2624 Sweeper 86.9 2440.0 1.11 0.51 R
2625 4-Seam Fastball 99.2 2320.0 0.04 0.81 R
2626 4-Seam Fastball 97.9 2164.0 0.08 1.06 R
2627 4-Seam Fastball 99.8 2182.0 -0.19 1.07 R

2628 rows × 6 columns

Prepare Data for Machine Learning

Create the X and y variants of the data for use with sklearn:

# create X and y for train
X_train = pitches_train.drop("pitch_name", axis=1)
y_train = pitches_train["pitch_name"]

# create X and y for test
X_test = pitches_test.drop("pitch_name", axis=1)
y_test = pitches_test["pitch_name"]

You can assume that within the autograder, similar processing is performed on the production data.

Sample Statistics

Before modeling, be sure to look at the data. Calculate the summary statistics requested on PrairieLearn and create a visualization for your report.

Consider using seaborn to create a pairplot or jointplot to visualize this data.

Models

For this lab you will select one model to submit to the autograder. You may use any modeling techniques you’d like. The only rules are:

  • Models must start from the given training data, unmodified.
    • Importantly, the types and shapes of X_train and y_train should not be changed.
    • In the autograder, we will call mod.predict(X_test) on your model, where your model is loaded as mod and X_test has a compatible shape with and the same variable names and types as X_train.
    • In the autograder, we will call mod.predict(X_prod) on your model, where your model is loaded as mod and X_prod has a compatible shape with and the same variable names and types as X_train.
    • We assume that you will use a Pipeline and GridSearchCV from sklearn as you will need to deal with heterogeneous data, and you should be using cross-validation.
      • So more specifically, you should create a Pipeline that is fit with GridSearchCV. Done correctly, this will store a “model” that you can submit to the autograder.
  • Your model must have a fit method.
  • Your model must have a predict method.
  • Your model must have a predict_proba method.
  • Your model should be created with scikit-learn version 1.5.2 or newer.
  • Your model should be serialized with joblib version 1.4.2 or newer.
    • Your serialized model must be less than 5MB.

While you can use any modeling technique, each lab is designed such that a model using only techniques seen so far in the course can pass the checks in the autograder.

To obtain the maximum points via the autograder, your model performance must meet or exceed:

Test Accuracy: 0.935
Production Accuracy: 0.935

Model Persistence

To save your model for submission to the autograder, use the dump function from the joblib library. Check PrairieLearn for the filename that the autograder expects for this lab.

from joblib import dump
dump(mod, "filename.joblib")

Discussion

As always, be sure to state a conclusion, that is, whether or not you would use the model you trained and selected for the real world scenario described at the start of the lab! Justify your conclusion. If you trained multiple models that are mentioned in your report, first make clear which model you selected and are considering for use in practice.

Additional discussion prompts:

  • Here, we’ve only considered one pitcher. How would this system work given that there are hundreds of pitchers in MLB and several used in one game? Can we create a model for any pitcher that could pitch in a game? (Hint: Every pitcher has to throw their first MLB pitch someday!)
  • We’re using 2022 for training data and part of 2023 as test data, which is reasonable, as predictions will need to be made for future pitches. However, does this create potential issues? (Hint: Pitchers sometimes learn new pitches! Pitchers also change how they throw pitches they already know how to throw during the off-season. This will explain why your test accuracy will likely be a bit lower than a cross-validated accuracy.)
  • Do you think your classifier predicts fast enough to work in real-time?

When answering discussion prompts: Do not simply answer the prompt! Answer the prompt, but write as if the prompt did not exist. Write your report as if the person reading it did not have access to this document!

Template Notebook

Submission

Before submission, especially of your report, you should be sure to review the Lab Policy page.

On Canvas, be sure to submit both your source .ipynb file and a rendered .html version of the report.