import pandas as pd
Lab 09: MLB Swing Probability Modeling
For Lab 09, you will use baseball pitch data to develop a model that will estimate the probability that a batter swings.
Swing at the strikes.
Background
While the game of baseball is a competition between two teams, it ultimately reduces to a struggle between a specific batter and pitcher.
- The job of a pitcher is to prevent batters from reaching base. They can do this by striking them out, inducing a ground out, or getting the batter to fly out.
- The objective of the batter is to reach base, either via a walk or a hit that results from making solid contact.
These conflicting goals are sought via a psychological struggle, with the two players trying to outthink each other. The batter tries to anticipate the type and location of a pitch, while the pitcher tries to deceive the batter.
Depending on the game situation and characteristics of the pitcher, a pitcher often throws a pitch with the intention to make the batter swing, or not.
- Pitchers want batters to swing at pitches that have a low probability of success, that is, either a swing-and-miss (strike), or weak contact that results in a field out. They do this by throwing pitches to locations that are hard to hit, but attempt to make them appear like they will be easy to hit.
- Pitchers want batters to take (not swing) pitches that will be called for a strike. Pitchers do this by throwing pitches that look like balls as the approach the plate, but just barely cross the strike zone.
Modern baseball analytics is interested in studying when batters swing at pitches, both to assist in pitcher development, and as part of larger data analytics systems.
Scenario and Goal
Who are you?
- You are a data scientist working for a Major League Baseball (MLB) team as a part of their Research & Development department.
What is your task?
- Your goal is to develop a well calibrated probability model that estimates the probability of inducing a batter to swing given the characteristics of a pitch thrown, in addition to other information such as game situation, for a particular pitcher.
Who are you writing for?
- To summarize your work, you will write a report for the VP of Research & Development. You can assume the VP is a baseball expert, and reasonably familiar with the general concepts of data analysis and machine learning.
Data
To achieve the goal of this lab, we will need previously thrown pitches. The necessary data is provided in the following files:
Source
This lab will use data from Statcast. We will focus on a model for a specific pitcher, Zac Gallen.
In this lab, the train-test split is done within the 2023 MLB Season. That is, the training data is Zac Gallen pitches that occurred between opening day (2023-03-30
) and the trade deadline (2023-08-31
). The test data covers the remainder of the season, from September 1 (2023-09-01
) to the final day of the regular season (2023-10-02
). Hidden from you is the production data which covers the postseason (2023-10-03
to 2023-11-01
).
We do this in place of randomly splitting the data in an attempt to create a model that can predict into the future. Imagine this model is created on the final day of the regular season, then possibly used to make baseball decisions during the playoffs.
Because Statcast data can change at any moment, it is constantly changed and improved, we provide a snapshot of the data for use in this lab. If you are interested in obtaining similar data for other analyses, we recommend the pybaseball
package.
Data Dictionary
Each observation in the train, test, and (hidden) production data contains information about a pitch thrown by Zac Gallen.
Response
swing
[int64]
Whether or not the batter swung (1
) or took (0
).
Features
While we will certainly not be able to make any truly causal claims about our model, it is important to understand which variables are controlled by the pitcher. We could imagine a coach using this model to help explain to a pitcher where and how to throw a pitch if they want to induce a swing. As such, we will group the feature variables based on the degree of control the pitcher asserts over them.
Fully Pitcher Controlled
This variable is fully controlled by the pitcher. In modern baseball, this information is communicated between the pitcher and catcher before the pitch via PitchCom.
pitch_name
[object]
The name of the pitch type to be thrown.
Mostly Pitcher Controlled
These variables are largely controlled by the pitcher, but even at the highest levels of baseball, there will be variance based on skill, fatigue, etc. There variables essentially measure where the pitcher’s arm is locations as a pitch is thrown.
release_extension
[float64]
Release extension of pitch in feet as tracked by Statcast.
release_pos_x
[float64]
Horizontal Release Position of the ball measured in feet from the catcher’s perspective.
release_pos_y
[float64]
Release position of pitch measured in feet from the catcher’s perspective.
release_pos_z
[float64]
Vertical Release Position of the ball measured in feet from the catcher’s perspective.
Somewhat Pitcher Controlled
These variables are in some sense controlled by the pitcher, but less so than the previous variables. At the MLB level, pitchers will have some control here, but even at the highest levels, there can be a lot of variance. The speed and spin features are highly dependent on the pitch type thrown.
release_speed
[float64]
Velocity of the pitch thrown.
release_spin_rate
[float64]
Spin rate of pitch tracked by Statcast.
spin_axis
[float64]
The spin axis in the 2D X-Z plane in degrees from 0 to 360, such that 180 represents a pure backspin fastball and 0 degrees represents a pure topspin (12-6) curveball.
plate_x
[float64]
Horizontal position of the ball when it crosses home plate from the catcher’s perspective.
plate_z
[float64]
Vertical position of the ball when it crosses home plate from the catcher’s perspective.
Downstream Pitcher Controlled
Theses variables are pitch characteristics, and maybe somewhat controlled by the pitcher, but are largely functions of the previous variables.
pfx_x
[float64]
Horizontal movement in feet from the catcher’s perspective.
pfx_z
[float64]
Vertical movement in feet from the catcher’s perspective.
Situational Information
These variables describe part of the game situation when the pitch was thrown. (We have omitted some other obvious variables here like score and inning for simplicity.) These are fixed before a pitch is thrown, but could have an effect. Pitchers and batters often act differently based on the game situation. For example, batters are known to “protect” when there are two strikes, thus, are much more likely to swing.
balls
[int64]
Pre-pitch number of balls in count.
strikes
[int64]
Pre-pitch number of strikes in count.
on_3b
[int64]
Pre-pitch MLB Player Id of Runner on 3B.
on_2b
[int64]
Pre-pitch MLB Player Id of Runner on 2B.
on_1b
[int64]
Pre-pitch MLB Player Id of Runner on 1B.
outs_when_up
[int64]
Pre-pitch number of outs.
Fixed Batter Information
These variables give some information about the batter facing the pitcher. In particular, are they a righty or lefty, and the size of their strike zone, which is a function of their height.
stand
[object]
Side of the plate batter is standing.
sz_top
[float64]
Top of the batter’s strike zone set by the operator when the ball is halfway to the plate.
sz_bot
[float64]
Bottom of the batter’s strike zone set by the operator when the ball is halfway to the plate.
Data in Python
To load the data in Python, use:
= pd.read_csv(
pitches_train "https://cs307.org/lab-09/data/pitches-train.csv",
)= pd.read_csv(
pitches_test "https://cs307.org/lab-09/data/pitches-test.csv",
)
Prepare Data for Machine Learning
Because the data is already train-test split, we can simply create the X
and y
variants of the data.
# create X and y for train data
= pitches_train.drop(columns=["swing"])
X_train = pitches_train["swing"]
y_train
# create X and y for test data
= pitches_test.drop(columns=["swing"])
X_test = pitches_test["swing"] y_test
You can assume that within the autograder, similar processing is performed on the production data.
Sample Statistics
Before modeling, be sure to look at the data. Calculate the summary statistics requested on PrairieLearn and create a visualization for your report.
Models
For this lab you will select one model to submit to the autograder. You may use any modeling techniques you’d like. The only rules are:
- Models must start from the given training data, unmodified.
- Importantly, the types and shapes of
X_train
andy_train
should not be changed. - In the autograder, we will call
mod.predict(X_test)
on your model, where your model is loaded asmod
andX_test
has a compatible shape with and the same variable names and types asX_train
. - In the autograder, we will call
mod.predict(X_prod)
on your model, where your model is loaded asmod
andX_prod
has a compatible shape with and the same variable names and types asX_train
. - We assume that you will use a
Pipeline
andGridSearchCV
fromsklearn
as you will need to deal with heterogeneous data, and you should be using cross-validation.- So more specifically, you should create a
Pipeline
that is fit withGridSearchCV
. Done correctly, this will store a “model” that you can submit to the autograder.
- So more specifically, you should create a
- Importantly, the types and shapes of
- Your model must have a
fit
method. - Your model must have a
predict
method. - Your model must have a
predict_proba
method. - Your model should be created with
scikit-learn
version1.5.2
or newer. - Your model should be serialized with
joblib
version1.4.2
or newer.- Your serialized model must be less than 5MB.
To obtain the maximum points via the autograder, your model performance must meet or exceed:
Test ECE: 0.075
Test MCE: 0.135
Production ECE: 0.075
Production ECE: 0.145
Probability Calibration
What are the metrics ECE and MCE? How are we evaluating models here?
Models will not be evaluated on their ability to classify to swing or not. Instead, we will directly asses their ability to estimate the probability of a swing. Thus, you need a well-calibrated model.
The above sklearn
user guide page will provide some hints. Importantly, you may need to use CalibratedClassifierCV
to further calibrate the probability estimates from a classifier.
In the autograder, we will use two metrics to assess your submitted model:
- Expected Calibration Error (ECE): This is essentially an average of the distance the points on a calibration plot are from the “perfect” line.
- Maximum Calibration Error (MCE): This is essentially the furthest any point on a calibration plot is from the “perfect” line.
We provide Python functions for creating calibration plots and calculating these metrics.
For further reference:
Model Persistence
To save your model for submission to the autograder, use the dump
function from the joblib
library. Check PrairieLearn for the filename that the autograder expects for this lab.
from joblib import dump
"filename.joblib") dump(mod,
Discussion
As always, be sure to state a conclusion, that is, whether or not you would use the model you trained and selected for the real world scenario described at the start of the lab! Justify your conclusion. If you trained multiple models that are mentioned in your report, first make clear which model you selected and are considering for use in practice.
Additional discussion topics:
- If you tried to use this model as an MLB coach, which variables would you ask the pitcher to modify to induce a swing? Why?
When answering discussion prompts: Do not simply answer the prompt! Answer the prompt, but write as if the prompt did not exist. Write your report as if the person reading it did not have access to this document!
Template Notebook
Submission
On Canvas, be sure to submit both your source .ipynb
file and a rendered .html
version of the report.