import pandas as pd
Lab 08: MLB Swing Probability Modeling
Scenario: Suppose you work for a Major League Baseball (MLB) team as a part of their analytics department. For a variety of reasons, it is useful to know if a batter will swing at a particular pitch, even independent of outcome. You are tasked with developing a well calibrated probability model that estimates the probability of inducing a swing given the characteristics of a pitch, in addition to other information such as game situation, for a particular pitcher. (In practice, you would build one model per pitcher.)
Goal
The goal of this lab is to develop a model to estimate the probability that an MLB batter swings at a pitch thrown by a particular pitcher.
Data
This lab will use data from Statcast. We will focus on a model for a specific pitcher, Zac Gallen.
In this lab, the train-test split is done within the 2023 MLB Season. That is, the training data is Zac Gallen pitches that occured between opening day (2023-03-30
) and the trade deadline (2023-08-31
). The test data covers the remainder of the season, from September 1 (2023-09-01
) to the final day of the regular season (2023-10-02
). Hidden from you is the production data which covers the postseason (2023-10-03
to 2023-11-01
).
We do this in place of randomly splitting the data in an attempt to create a model that can predict into the future. Imagine this model is created on the final day of the regular season, then possibly used to make baseball decisions during the playoffs.
Because Statcast data can change at any moment, it is constantly changed and improved, we provide a snapshot of the data for use in this lab.
Each sample is a pitch thrown by Zac Gallen.
Response
swing
[int64]
Whether or not the batter swung.
Features
While we will certainly not be able to make any truly causal claims about our model, it is important to understand which variables are controlled by the pitcher. We could imagine a coach using this model to help explain to a pitcher where and how to throw a pitch if they want to induce a swing.
Fully Pitcher Controlled
This variable is fully controlled by the pitcher.
pitch_name
[object]
The name of the pitch type to be thrown.
Mostly Pitcher Controlled
These variables are largely controlled by the pitcher, but even at the highest levels of baseball, there will be variance based on skill, fatigue, etc.
release_extension
[float64]
Release extension of pitch in feet as tracked by Statcast.
release_pos_x
[float64]
Horizontal Release Position of the ball measured in feet from the catcher’s perspective.
release_pos_y
[float64]
Release position of pitch measured in feet from the catcher’s perspective.
release_pos_z
[float64]
Vertical Release Position of the ball measured in feet from the catcher’s perspective.
Somewhat Pitcher Controlled
These variables are in some sense controlled by the pitcher, but less so than the previous variables. At the MLB level, pitchers will have some control here, but even at the highest levels, there can be a lot of variance.
release_speed
[float64]
Velocity of the pitch thrown.
release_spin_rate
[float64]
Spin rate of pitch tracked by Statcast.
spin_axis
[float64]
The spin axis in the 2D X-Z plane in degrees from 0 to 360, such that 180 represents a pure backspin fastball and 0 degrees represents a pure topspin (12-6) curveball.
plate_x
[float64]
Horizontal position of the ball when it crosses home plate from the catcher’s perspective.
plate_z
[float64]
Vertical position of the ball when it crosses home plate from the catcher’s perspective.
Downstream Pitcher Controlled
Theses variables are pitch characteristics, and maybe somewhat controlled by the pitcher, but are largely functions of the previous variables.
pfx_x
[float64]
Horizontal movement in feet from the catcher’s perspective.
pfx_z
[float64]
Vertical movement in feet from the catcher’s perspective.
Situational Information
These variables describe part of the game situation when the pitch was thrown. (We have omitted some other obvious variables here like score and inning, just for simplicity.) These are fixed before a pitch is thrown, but could have an effect. Pitchers and batters often act differently based on the game situation. For example, batters are known to “protect” when there are two strikes, thus, much more likely to swing.
balls
[int64]
Pre-pitch number of balls in count.
strikes
[int64]
Pre-pitch number of strikes in count.
on_3b
[int64]
Pre-pitch MLB Player Id of Runner on 3B.
on_2b
[int64]
Pre-pitch MLB Player Id of Runner on 2B.
on_1b
[int64]
Pre-pitch MLB Player Id of Runner on 1B.
outs_when_up
[int64]
Pre-pitch number of outs.
Fixed Batter Information
These variables give some information about the batter facing the pitch. In particular, are they a righty or lefty, and the size of their strike zone, which is a function of their height.
stand
[object]
Side of the plate batter is standing.
sz_top
[float64]
Top of the batter’s strike zone set by the operator when the ball is halfway to the plate.
sz_bot
[float64]
Bottom of the batter’s strike zone set by the operator when the ball is halfway to the plate.
Data in Python
To load the data in Python, use:
= pd.read_csv("https://cs307.org/lab-08/data/pitches-train.csv")
pitches_train = pd.read_csv("https://cs307.org/lab-08/data/pitches-test.csv") pitches_test
Prepare Data for Machine Learning
Because the data is already train-test split, we can simply create the X
and y
variants of the data.
# create X and y for train data
= pitches_train.drop(columns=["swing"])
X_train = pitches_train["swing"]
y_train
# create X and y for test data
= pitches_test.drop(columns=["swing"])
X_test = pitches_test["swing"] y_test
Sample Statistics
Before modeling, be sure to look at the data. Calculate the summary statistics requested on PrairieLearn and create a visualization for your report.
Models
For this lab, you will need to train two separate but related models.
Probability Model: Train a supervised (likely classifier) to predict swing
from the other variables. However, this model will not be evaluated on its ability to classify to swing or not. Instead, we will directly asses its ability to estimate the probability of a swing. Thus, you need a well-calibrated model.
The above sklearn
user guide page will provide some hints. Importantly, you may need to use CalibratedClassifierCV
to further calibrate the probability estimates from a classifier.
In the autograder, we will use two metrics to assess your submitted model:
- Expected Calibration Error (ECE): This is essentially how far on average the points on a calibration plot are from the “perfect” line.
- Maximum Calibration Error (MCE): This is essentially the furthest any point on a calibration plot is from the “perfect” line.
We provide Python functions for creating calibration plots and calculating these metrics.
For further reference:
Novelty Detector: The second model you will train is an unsupervised novelty detector. It should be fit to the training features only. We will test how many observations it flags as novel (outliers) in the test data. Your detector should detect at least one novel observation in the test and production data, but flag no more than 5% of the observations. Use 1
for inliers and -1
for outlier as is the default in sklearn
.
For this lab, you may train models however you’d like! The only rules are:
Probability Model:
- Models must start from the given training data, unmodified.
- Importantly, the types and shapes of
X_train
andy_train
should not be changed. - In the autograder, we will call
mod.predict(X_test)
on your model, where your model is loaded asmod
andX_test
has a compatible shape with and the same variable names and types asX_train
. - We assume that you will use a
Pipeline
andGridSearchCV
fromsklearn
as you will need to deal with heterogeneous data, and you should be using cross-validation.- So more specifically, you should create a
Pipeline
that is fit withGridSearchCV
. Done correctly, this will store a “model” that you can submit to the autograder.
- So more specifically, you should create a
- Importantly, the types and shapes of
- Your model must have a
fit
method. - Your model must have a
predict
method that returns numbers. - Your model must have a
predict_proba
method that returns numbers. - Your model should be created with
scikit-learn
version1.4.0
or newer. - Your model should be serialized with
joblib
version1.3.2
or newer.- Your serialized model must be less than 5MB.
Novelty Detector:
- Models must start from the given training data, unmodified.
- Importantly, the types and shapes of
X_train
andy_train
should not be changed. - In the autograder, we will call
mod.predict(X_test)
on your model, where your model is loaded asmod
andX_test
has a compatible shape with and the same variable names and types asX_train
.
- Importantly, the types and shapes of
- Your model should be created with
scikit-learn
version1.4.0
or newer. - Your model should be serialized with
joblib
version1.3.2
or newer.- Your serialized model must be less than 5MB.
To obtain the maximum points via the autograder, your model performance must meet or exceed:
Test ECE: 0.075
Test MCE: 0.135
Test Proportion Novel: 0.05
Production ECE: 0.075
Production ECE: 0.135
Production Proportion Novel: 0.05
Model Persistence
To save your models for submission to the autograder, use the dump
function from the joblib
library. Check PrairieLearn for the filename that the autograder expects.
Discussion
As always, be sure to state a conclusion, that is, whether or not you would use the model you trained and selected for the real world scenario described at the start of the lab! If you are asked to train multiple models, first make clear which model you selected and are considering for use in practice. Discuss any limitations or potential improvements.
Additional discussion topics:
- If you tried to use this model as an MLB coach, which variables would you ask the pitcher to modify to induce a swing? Why?
When answering discussion prompts: Do not simply answer the prompt! Answer the prompt, but write as if the prompt did not exist. Write your report as if the person reading it did not have access to this document!
Template Notebook
Submission
On Canvas, be sure to submit both your source .ipynb
file and a rendered .html
version of the report.