Pitch Classification

This page presents information about the Pitch Classification dataset which will be used as a part of Lab 03 in CS 307.

It’s tough to make predictions, especially about the future.

– Yogi Berra

Source

The original source of the Pitch Classification data is Statcast. Specifically, we will use the pybaseball package to interface with Statcast Search, which is part of Baseball Savant.

Pitch Classification

What is a pitch type you might ask? Well, it’s complicated. Let’s allow someone else to explain:

As we’ll see in a moment, while the pitch type is technically defined by what the pitcher claims they threw, we can probably infer the pitch type based on only speed and spin.

Now that you’re a pitch type expert, here’s a game to see how well you can identify pitches from video:

That game was difficult wasn’t it? Don’t feel bad! Identifying pitches with your eyes is difficult. It is even more difficult when you realize the cameras are playing tricks on you:

But wait! Then how do television broadcasts of baseball games instantly display the pitch type? You guessed it… machine learning! For a deep dive on how they do this, see here:

The long story short is:

  • Have advanced tracking technology that can instantly record speed and spin for each pitch.
  • Have a trained classifier for pitch type based on speed and spin.
  • In real time, make predictions of pitch type as soon as the speed and spin are recorded.
  • Display the result in the stadium and on the broadcast!

There are several ways to accomplish this task, but one decision we’ll make is to make one-model-per-pitcher.

Data Dictionary

The Pitch Classification dataset used in CS 307 will include preprocessing for specific use in Lab 03.

We will look at data for a single pitcher. Shohei Ohtani is a pitcher that pitched for the 2022 Los Angeles Angels and the 2023 Los Angeles Angels. He will pitch for the Los Angeles Dodgers in 2024.

Note that we are only providing a subset of available features. Additionally, we have poisoned the data with a non-trivial amount of missing data.

We document this specific data here. Each observation contains information about a single pitch thrown by Shohei Ohtani in either 2022 (train data) or 2023 (test data) during an MLB regular season game.

pitch_name

  • [object] the name of the pitch derived from the Statcast Data, which is the name of the pitch type thrown

release_speed

  • [float64] pitch velocity (miles per hour) measured shortly after leaving the pitcher’s hand

release_spin_rate

  • [float64] pitch spin rate (revolutions per minute) measured shortly after leaving the pitcher’s hand

pfx_x

  • [float64] horizontal movement (feet) of the pitch from the catcher’s perspective.

pfx_z

  • [float64] vertical movement (feet) of the pitch from the catcher’s perspective.

stand

  • [object] side of the plate batter is standing, either L (left) or R (right)

Original and complete documentation for this data can be found in the Statcast Search CSV Documentation.

Data for Machine Learning

For CS 307 lab, we will provide training data, stored as a CSV file, accessible via the web.

The test data is only available within the autograder.

Here, the train-test split is based on time.

  • Train: 2022 MLB Season
  • Test: 2023 MLB Season

Loading the Data

import pandas as pd
pitches_train = pd.read_csv("https://cs307.org/lab-03/data/pitches-train.csv")
pitches_train
pitch_name release_speed release_spin_rate pfx_x pfx_z stand
0 Sweeper 84.7 2667.0 1.25 0.01 R
1 Sweeper 83.9 2634.0 1.41 0.20 R
2 Sweeper 84.4 2526.0 1.26 0.25 R
3 Curveball 74.3 2389.0 0.93 -1.10 L
4 Sweeper 85.6 2474.0 1.08 0.52 R
... ... ... ... ... ... ...
2623 Split-Finger 91.8 1314.0 -0.30 0.08 R
2624 Sweeper 86.9 2440.0 1.11 0.51 R
2625 4-Seam Fastball 99.2 2320.0 0.04 0.81 R
2626 4-Seam Fastball 97.9 2164.0 0.08 1.06 R
2627 4-Seam Fastball 99.8 2182.0 -0.19 1.07 R

2628 rows × 6 columns