Lab 03: Pitches

For Lab 03, you will use baseball pitch data to develop a model that will predict a pitch’s “type” based on its characteristics.

It’s tough to make predictions, especially about the future.

– Yogi Berra

Background

What is a pitch type you might ask? Well, it’s complicated.

YouTube: How to Identify Baseball Pitches

As we’ll see in a moment, while the pitch type is technically defined by what the pitcher claims they threw, we can probably infer the pitch type based on only speed and spin.

Now that you’re a pitch type expert, here’s a game to see how well you can identify pitches from video:

Baseball Savant: Guess the Pitch Type

That game was difficult wasn’t it? Don’t feel bad! Identifying pitches with your eyes is difficult. It is even more difficult when you realize the cameras are playing tricks on you:

YouTube: Camera Angles in MLB and How It Affects Us, A Deep-Dive

But wait! Then how do television broadcasts of baseball games instantly display the pitch type? You guessed it… machine learning! For a deep dive on how they do this, see here:

MLB Technology Blog: MLB Pitch Classification

The long story short is:

Have advanced tracking technology that can instantly record speed, spin, and other measurements for each pitch.
Have a trained classifier for pitch type based on speed, spin, and more.
In real time, make a prediction of the pitch type as soon as the speed and spin are recorded.
Display the result in the stadium and on the broadcast!

There are several ways to accomplish this task, but one decision we’ll make is to make one-model-per-pitcher. For the purpose of this lab, we will only model the pitches for a single pitcher, Shohei Ohtani.

Shohei Ohtani is a pitcher that pitched for the 2022 Los Angeles Angels and the 2023 Los Angeles Angels.

Baseball Savant: Shohei Ohtani

Ohtani played for the 2024 Los Angeles Dodgers, but due to injury did not pitch. Instead, on September 19, 2024, he became the first player in MLB history to achieve a “50-50” season, that is, 50 home runs and 50 stolen bases. Yes, in addition to being an elite pitcher, he is also setting records as a batter, which is truly unprecedented in baseball.

For the purpose of this lab however, we will assume that it is still the 2023 season.

Scenario and Goal

Who are you?

You are a data scientist working for Major League Baseball (MLB) as part of the broadcast operations team. The current date is July 11, 2023, the date of the 2023 MLB All-Star Game.

What is your task?

You are tasked with creating (or updating) a model to assist with automatically displaying the pitch type for each pitch in real-time, both in the stadium, and on the television broadcast. You have access to data on every pitch thrown in MLB to date, including characteristics of the pitch such as its velocity and rotation, as well as the type of pitch thrown. Additionally, tracking technology in each stadium will provide data on (at least) the speed and rotation of each pitch, in real-time. You are asked to create a proof-of-concept for a single pitcher, Shohei Ohtani, but your process should be designed such that it can easily be applied to other pitchers as well.

Who are you writing for?

To summarize your work, you will write a report for your manager, who reports to the Vice President of Media Operations. You can assume your manager is very familiar with baseball and associated broadcasting. With a focus on broadcasting, your manager is especially concerned with the ability of your model to work in real-time, that is, it must make predictions nearly instantaneously after the pitch is thrown.

Data

To achieve the goal of this lab, we will need historical pitching data. The necessary data is provided in the following files:

Source

The original source of the data is Statcast. Specifically, the pybaseball package was used to interface with Statcast Search, which is part of Baseball Savant.

Data Dictionary

Each observation contains information about a single pitch thrown by Shohei Ohtani in either 2022 (train data) or 2023 (test data) during an MLB regular season game.

Here, the train-test split is based on time.

Train: 2022 MLB Season
Test: (First Half of) 2023 MLB Season

Original and (mostly) complete documentation for Statcast data can be found in the Statcast Search CSV Documentation. A more detailed reference can be found in Appendix C of Analyzing Baseball Data with R.¹

Response

pitch_name

[object] the name of the pitch, which is the name of the pitch type thrown

Features

release_speed

[float64] pitch velocity (miles per hour) measured shortly after leaving the pitcher’s hand

release_spin_rate

[float64] pitch spin rate (revolutions per minute) measured shortly after leaving the pitcher’s hand

pfx_x

[float64] horizontal movement (feet) of the pitch from the catcher’s perspective.

pfx_z

[float64] vertical movement (feet) of the pitch from the catcher’s perspective.

stand

[object] side of the plate batter is standing, either L (left) or R (right)

Data in Python

To load the data in Python, use:

import pandas as pd

pitches_train = pd.read_parquet(
    "https://cs307.org/lab/data/pitches-train.parquet",
)
pitches_test = pd.read_parquet(
    "https://cs307.org/lab/data/pitches-test.parquet",
)

Prepare Data for Machine Learning

Create the X and y variants of the data for use with sklearn:

# create X and y for train
X_train = pitches_train.drop("pitch_name", axis=1)
y_train = pitches_train["pitch_name"]

# create X and y for test
X_test = pitches_test.drop("pitch_name", axis=1)
y_test = pitches_test["pitch_name"]

You can assume that within the autograder, similar processing is performed on the production data.

Sample Statistics

Before modeling, be sure to look at the data. Calculate the summary statistics requested on PrairieLearn.

Models

For this lab you will select one model to submit to the autograder. You may use any modeling techniques you’d like, so long as it meets these requirements:

Your model must start from the given training data, unmodified.
- Importantly, the types and shapes of X_train and y_train should not be changed.
- In the autograder, we will call mod.predict(X_test) on your model, where your model is loaded as mod and X_test has a compatible shape with and the same variable names and types as X_train.
- In the autograder, we will call mod.predict(X_prod) on your model, where your model is loaded as mod and X_prod has a compatible shape with and the same variable names and types as X_train.
- If preprocessing is necessary, it should be included in your model via a pipeline.
Your model must have a fit method.
Your model must have a predict method.
Your model must have a predict_proba method.
Your model should be created with scikit-learn version 1.6.1 or newer.
Your model should be serialized with joblib version 1.4.2 or newer.
Your serialized model must be less than 5MB.

While you can use any modeling technique, each lab is designed such that a model using only techniques seen so far in the course can pass the checks in the autograder.

To obtain the maximum points via the autograder, your model must outperform the following metrics:

Test Accuracy: 0.937
Production Accuracy: 0.937

Submission

Before submission, especially of your report, you should be sure to review the Lab Policy page.

On Canvas, be sure to submit both your source .ipynb file and a rendered .html version of the report.

Footnotes

This book is the book if you’re looking to get into baseball analytics. Hopefully a Python version will be available soon.↩︎