import pandas as pd
Lab 03: Pitch Classification
For Lab 03, you will use baseball pitch data to develop a model that will predict a pitch’s “type” based on its characteristics.
It’s tough to make predictions, especially about the future.
Background
What is a pitch type you might ask? Well, it’s complicated. Let’s allow someone else to explain:
As we’ll see in a moment, while the pitch type is technically defined by what the pitcher claims they threw, we can probably infer the pitch type based on only speed and spin.
Now that you’re a pitch type expert, here’s a game to see how well you can identify pitches from video:
That game was difficult wasn’t it? Don’t feel bad! Identifying pitches with your eyes is difficult. It is even more difficult when you realize the cameras are playing tricks on you:
But wait! Then how do television broadcasts of baseball games instantly display the pitch type? You guessed it… machine learning! For a deep dive on how they do this, see here:
The long story short is:
- Have advanced tracking technology that can instantly record speed, spin, and other measurements for each pitch.
- Have a trained classifier for pitch type based on speed, spin, and more.
- In real time, make a prediction of the pitch type as soon as the speed and spin are recorded.
- Display the result in the stadium and on the broadcast!
There are several ways to accomplish this task, but one decision we’ll make is to make one-model-per-pitcher. For the purpose of this lab, we will only model the pitches for a single pitcher, Shohei Ohtani.
Shohei Ohtani is a pitcher that pitched for the 2022 Los Angeles Angels and the 2023 Los Angeles Angels.
Ohtani currently plays for the 2024 Los Angeles Dodgers, but due to injury is not currently pitching. Instead, on September 19, 2024, he became the first player in MLB history to achieve a “50-50” season, that is, 50 home runs and 50 stolen bases. Yes, in addition to being an elite pitcher, he is also setting records as a batter, which is truly unprecedented in baseball.
Scenario and Goal
Suppose you work for Major League Baseball (MLB) as part of the broadcast operations team. You are tasked with automatically displaying the pitch type for each pitch in real-time, both in the stadium, and on the television broadcast. You have access to data on every previous pitch thrown in MLB, including characteristics of the pitch such as its velocity and rotation, as well as the type of pitch thrown. Additionally, tracking technology in each stadium will provide data on (at least) the speed and rotation of each pitch, in real-time.
Your goal is to create a classification model that predicts the pitch type of a pitch thrown by a particular pitcher given the pitch’s velocity, rotation, movement, and position of the batter.
Data
To achieve the goal of this lab, we will need historical pitching data. The necessary data is provided in the following files:
Source
The original source of the data is Statcast. Specifically, the pybaseball
package was used to interface with Statcast Search, which is part of Baseball Savant.
Data Dictionary
Each observation contains information about a single pitch thrown by Shohei Ohtani in either 2022 (train data) or 2023 (test data) during an MLB regular season game.
Here, the train-test split is based on time.
- Train: 2022 MLB Season
- Test: (First Half of) 2023 MLB Season
Original and (mostly) complete documentation for Statcast data can be found in the Statcast Search CSV Documentation.
Response
pitch_name
[object]
the name of the pitch, which is the name of the pitch type thrown
Features
release_speed
[float64]
pitch velocity (miles per hour) measured shortly after leaving the pitcher’s hand
release_spin_rate
[float64]
pitch spin rate (revolutions per minute) measured shortly after leaving the pitcher’s hand
pfx_x
[float64]
horizontal movement (feet) of the pitch from the catcher’s perspective.
pfx_z
[float64]
vertical movement (feet) of the pitch from the catcher’s perspective.
stand
[object]
side of the plate batter is standing, eitherL
(left) orR
(right)
Data in Python
To load the data in Python, use:
= pd.read_parquet(
pitches_train "https://cs307.org/lab-03/data/pitches-train.parquet"
)= pd.read_parquet(
pitches_test "https://cs307.org/lab-03/data/pitches-test.parquet"
)
pitches_train
pitch_name | release_speed | release_spin_rate | pfx_x | pfx_z | stand | |
---|---|---|---|---|---|---|
0 | Sweeper | 84.7 | 2667.0 | 1.25 | 0.01 | R |
1 | Sweeper | 83.9 | 2634.0 | 1.41 | 0.20 | R |
2 | Sweeper | 84.4 | 2526.0 | 1.26 | 0.25 | R |
3 | Curveball | 74.3 | 2389.0 | 0.93 | -1.10 | L |
4 | Sweeper | 85.6 | 2474.0 | 1.08 | 0.52 | R |
... | ... | ... | ... | ... | ... | ... |
2623 | Split-Finger | 91.8 | 1314.0 | -0.30 | 0.08 | R |
2624 | Sweeper | 86.9 | 2440.0 | 1.11 | 0.51 | R |
2625 | 4-Seam Fastball | 99.2 | 2320.0 | 0.04 | 0.81 | R |
2626 | 4-Seam Fastball | 97.9 | 2164.0 | 0.08 | 1.06 | R |
2627 | 4-Seam Fastball | 99.8 | 2182.0 | -0.19 | 1.07 | R |
2628 rows × 6 columns
Prepare Data for Machine Learning
Create the X
and y
variants of the data for use with sklearn
:
# create X and y for train
= pitches_train.drop("pitch_name", axis=1)
X_train = pitches_train["pitch_name"]
y_train
# create X and y for test
= pitches_test.drop("pitch_name", axis=1)
X_test = pitches_test["pitch_name"] y_test
You can assume that within the autograder, similar processing is performed on the production data.
Sample Statistics
Before modeling, be sure to look at the data. Calculate the summary statistics requested on PrairieLearn and create a visualization for your report.
Models
For this lab you will select one model to submit to the autograder. You may use any modeling techniques you’d like. The only rules are:
- Models must start from the given training data, unmodified.
- Importantly, the types and shapes of
X_train
andy_train
should not be changed. - In the autograder, we will call
mod.predict(X_test)
on your model, where your model is loaded asmod
andX_test
has a compatible shape with and the same variable names and types asX_train
. - In the autograder, we will call
mod.predict(X_prod)
on your model, where your model is loaded asmod
andX_prod
has a compatible shape with and the same variable names and types asX_train
. - We assume that you will use a
Pipeline
andGridSearchCV
fromsklearn
as you will need to deal with heterogeneous data, and you should be using cross-validation.- So more specifically, you should create a
Pipeline
that is fit withGridSearchCV
. Done correctly, this will store a “model” that you can submit to the autograder.
- So more specifically, you should create a
- Importantly, the types and shapes of
- Your model must have a
fit
method. - Your model must have a
predict
method. - Your model must have a
predict_proba
method. - Your model should be created with
scikit-learn
version1.5.2
or newer. - Your model should be serialized with
joblib
version1.4.2
or newer.- Your serialized model must be less than 5MB.
To obtain the maximum points via the autograder, your model performance must meet or exceed:
Test Accuracy: 0.935
Production Accuracy: 0.935
Model Persistence
To save your model for submission to the autograder, use the dump
function from the joblib
library. Check PrairieLearn for the filename that the autograder expects for this lab.
from joblib import dump
"filename.joblib") dump(mod,
Discussion
As always, be sure to state a conclusion, that is, whether or not you would use the model you trained and selected for the real world scenario described at the start of the lab! Justify your conclusion. If you trained multiple models that are mentioned in your report, first make clear which model you selected and are considering for use in practice.
Additional discussion prompts:
- Here, we’ve only considered one pitcher. How would this system work given that there are hundreds of pitchers in MLB and several used in one game? Can we create a model for any pitcher that could pitch in a game? (Hint: Every pitcher has to throw their first MLB pitch someday!)
- We’re using 2022 for training data and part of 2023 as test data, which is reasonable, as predictions will need to be made for future pitches. However, does this create potential issues? (Hint: Pitchers sometimes learn new pitches! Pitchers also change how they throw pitches they already know how to throw during the off-season. This will explain why your test accuracy will likely be a bit lower than a cross-validated accuracy.)
- Do you think your classifier predicts fast enough to work in real-time?
When answering discussion prompts: Do not simply answer the prompt! Answer the prompt, but write as if the prompt did not exist. Write your report as if the person reading it did not have access to this document!
Template Notebook
Submission
On Canvas, be sure to submit both your source .ipynb
file and a rendered .html
version of the report.