import pandas as pd
Lab 03: Pitches
For Lab 03, you will use baseball pitch data to develop a model that will predict a pitch’s “type” based on its characteristics.
It’s tough to make predictions, especially about the future.
Background
What is a pitch type you might ask? Well, it’s complicated.
As we’ll see in a moment, while the pitch type is technically defined by what the pitcher claims they threw, we can probably infer the pitch type based on only speed and spin.
Now that you’re a pitch type expert, here’s a game to see how well you can identify pitches from video:
That game was difficult wasn’t it? Don’t feel bad! Identifying pitches with your eyes is difficult. It is even more difficult when you realize the cameras are playing tricks on you:
But wait! Then how do television broadcasts of baseball games instantly display the pitch type? You guessed it… machine learning! For a deep dive on how they do this, see here:
The long story short is:
- Have advanced tracking technology that can instantly record speed, spin, and other measurements for each pitch.
- Have a trained classifier for pitch type based on speed, spin, and more.
- In real time, make a prediction of the pitch type as soon as the speed and spin are recorded.
- Display the result in the stadium and on the broadcast!
There are several ways to accomplish this task, but one decision we’ll make is to make one-model-per-pitcher. For the purpose of this lab, we will only model the pitches for a single pitcher, Shohei Ohtani.
Shohei Ohtani is a pitcher that pitched for the 2022 Los Angeles Angels and the 2023 Los Angeles Angels.
Scenario and Goal
Who are you?
- You are a data scientist working for Major League Baseball (MLB) as part of the broadcast operations team. The current date is July 11, 2023, the date of the 2023 MLB All-Star Game.
What is your task?
- You are tasked with creating (or updating) a model to assist with automatically displaying the pitch type for each pitch in real-time, both in the stadium, and on the television broadcast. You have access to data on every pitch thrown in MLB to date, including characteristics of the pitch such as its velocity and rotation, as well as the type of pitch thrown. Additionally, tracking technology in each stadium will provide data on (at least) the speed and rotation of each pitch, in real-time. You are asked to create a proof-of-concept for a single pitcher, Shohei Ohtani, but your process should be designed such that it can easily be applied to other pitchers as well.
Who are you writing for?
- To summarize your work, you will write a report for your manager, who reports to the Vice President of Media Operations. You can assume your manager is very familiar with baseball and associated broadcasting. With a focus on broadcasting, your manager is especially concerned with the ability of your model to work in real-time, that is, it must make predictions nearly instantaneously after the pitch is thrown.
Data
To achieve the goal of this lab, we will need historical pitching data. The necessary data is provided in the following files:
Source
The original source of the data is Statcast. Specifically, the pybaseball
package was used to interface with Statcast Search, which is part of Baseball Savant.
Data Dictionary
Each observation contains information about a single pitch thrown by Shohei Ohtani in either 2022 (train data) or 2023 (test data) during an MLB regular season game.
Here, the train-test split is based on time.
- Train: 2022 MLB Season
- Test: (First Half of) 2023 MLB Season
Original and (mostly) complete documentation for Statcast data can be found in the Statcast Search CSV Documentation. A more detailed reference can be found in Appendix C of Analyzing Baseball Data with R.1
Response
pitch_name
[object]
the name of the pitch, which is the name of the pitch type thrown
Features
release_speed
[float64]
pitch velocity (miles per hour) measured shortly after leaving the pitcher’s hand
release_spin_rate
[float64]
pitch spin rate (revolutions per minute) measured shortly after leaving the pitcher’s hand
pfx_x
[float64]
horizontal movement (feet) of the pitch from the catcher’s perspective.
pfx_z
[float64]
vertical movement (feet) of the pitch from the catcher’s perspective.
stand
[object]
side of the plate batter is standing, eitherL
(left) orR
(right)
Data in Python
To load the data in Python, use:
= pd.read_parquet(
pitches_train "https://cs307.org/lab/data/pitches-train.parquet",
)= pd.read_parquet(
pitches_test "https://cs307.org/lab/data/pitches-test.parquet",
)
Prepare Data for Machine Learning
Create the X
and y
variants of the data for use with sklearn
:
# create X and y for train
= pitches_train.drop("pitch_name", axis=1)
X_train = pitches_train["pitch_name"]
y_train
# create X and y for test
= pitches_test.drop("pitch_name", axis=1)
X_test = pitches_test["pitch_name"] y_test
You can assume that within the autograder, similar processing is performed on the production data.
Sample Statistics
Before modeling, be sure to look at the data. Calculate the summary statistics requested on PrairieLearn.
Models
For this lab you will select one model to submit to the autograder. You may use any modeling techniques you’d like, so long as it meets these requirements:
- Your model must start from the given training data, unmodified.
- Importantly, the types and shapes of
X_train
andy_train
should not be changed. - In the autograder, we will call
mod.predict(X_test)
on your model, where your model is loaded asmod
andX_test
has a compatible shape with and the same variable names and types asX_train
. - In the autograder, we will call
mod.predict(X_prod)
on your model, where your model is loaded asmod
andX_prod
has a compatible shape with and the same variable names and types asX_train
. - If preprocessing is necessary, it should be included in your model via a pipeline.
- Importantly, the types and shapes of
- Your model must have a
fit
method. - Your model must have a
predict
method. - Your model must have a
predict_proba
method. - Your model should be created with
scikit-learn
version1.6.1
or newer. - Your model should be serialized with
joblib
version1.4.2
or newer. - Your serialized model must be less than 5MB.
To obtain the maximum points via the autograder, your model must outperform the following metrics:
Test Accuracy: 0.937
Production Accuracy: 0.937
Submission
On Canvas, be sure to submit both your source .ipynb
file and a rendered .html
version of the report.
Footnotes
This book is the book if you’re looking to get into baseball analytics. Hopefully a Python version will be available soon.↩︎