Lab 09: MLB Swing Probability Modeling

For Lab 09, you will use baseball pitch data to develop a model that will estimate the probability that a batter swings.

Swing at the strikes.

– Yogi Berra

Background

While the game of baseball is a competition between two teams, it ultimately reduces to a struggle between a specific batter and pitcher.

The job of a pitcher is to prevent batters from reaching base. They can do this by striking them out, inducing a ground out, or getting the batter to fly out.
The objective of the batter is to reach base, either via a walk or a hit that results from making solid contact.

These conflicting goals are sought via a psychological struggle, with the two players trying to outthink each other. The batter tries to anticipate the type and location of a pitch, while the pitcher tries to deceive the batter.

Depending on the game situation and characteristics of the pitcher, a pitcher often throws a pitch with the intention to make the batter swing, or not.

Pitchers want batters to swing at pitches that have a low probability of success, that is, either a swing-and-miss (strike), or weak contact that results in a field out. They do this by throwing pitches to locations that are hard to hit, but attempt to make them appear like they will be easy to hit.
Pitchers want batters to take (not swing) pitches that will be called for a strike. Pitchers do this by throwing pitches that look like balls as the approach the plate, but just barely cross the strike zone.

Modern baseball analytics is interested in studying when batters swing at pitches, both to assist in pitcher development, and as part of larger data analytics systems.

Scenario and Goal

Who are you?

You are a data scientist working for a Major League Baseball (MLB) team as a part of their Research & Development department.

What is your task?

Your goal is to develop a well calibrated probability model that estimates the probability of inducing a batter to swing given the characteristics of a pitch thrown, in addition to other information such as game situation, for a particular pitcher.

Who are you writing for?

To summarize your work, you will write a report for the VP of Research & Development. You can assume the VP is a baseball expert, and reasonably familiar with the general concepts of data analysis and machine learning.

Data

To achieve the goal of this lab, we will need previously thrown pitches. The necessary data is provided in the following files:

Source

This lab will use data from Statcast. We will focus on a model for a specific pitcher, Zac Gallen.

In this lab, the train-test split is done within the 2023 MLB Season. That is, the training data is Zac Gallen pitches that occurred between opening day (2023-03-30) and the trade deadline (2023-08-31). The test data covers the remainder of the season, from September 1 (2023-09-01) to the final day of the regular season (2023-10-02). Hidden from you is the production data which covers the postseason (2023-10-03 to 2023-11-01).

We do this in place of randomly splitting the data in an attempt to create a model that can predict into the future. Imagine this model is created on the final day of the regular season, then possibly used to make baseball decisions during the playoffs.

Because Statcast data can change at any moment, it is constantly changed and improved, we provide a snapshot of the data for use in this lab. If you are interested in obtaining similar data for other analyses, we recommend the pybaseball package.

Data Dictionary

Each observation in the train, test, and (hidden) production data contains information about a pitch thrown by Zac Gallen.

Response

swing

[int64] Whether or not the batter swung (1) or took (0).

Features

While we will certainly not be able to make any truly causal claims about our model, it is important to understand which variables are controlled by the pitcher. We could imagine a coach using this model to help explain to a pitcher where and how to throw a pitch if they want to induce a swing. As such, we will group the feature variables based on the degree of control the pitcher asserts over them.

Fully Pitcher Controlled

This variable is fully controlled by the pitcher. In modern baseball, this information is communicated between the pitcher and catcher before the pitch via PitchCom.

pitch_name

[object] The name of the pitch type to be thrown.

Mostly Pitcher Controlled

These variables are largely controlled by the pitcher, but even at the highest levels of baseball, there will be variance based on skill, fatigue, etc. There variables essentially measure where the pitcher’s arm is locations as a pitch is thrown.

release_extension

[float64] Release extension of pitch in feet as tracked by Statcast.

release_pos_x

[float64] Horizontal Release Position of the ball measured in feet from the catcher’s perspective.

release_pos_y

[float64] Release position of pitch measured in feet from the catcher’s perspective.

release_pos_z

[float64] Vertical Release Position of the ball measured in feet from the catcher’s perspective.

Somewhat Pitcher Controlled

These variables are in some sense controlled by the pitcher, but less so than the previous variables. At the MLB level, pitchers will have some control here, but even at the highest levels, there can be a lot of variance. The speed and spin features are highly dependent on the pitch type thrown.

release_speed

[float64] Velocity of the pitch thrown.

release_spin_rate

[float64] Spin rate of pitch tracked by Statcast.

spin_axis

[float64] The spin axis in the 2D X-Z plane in degrees from 0 to 360, such that 180 represents a pure backspin fastball and 0 degrees represents a pure topspin (12-6) curveball.

plate_x

[float64] Horizontal position of the ball when it crosses home plate from the catcher’s perspective.

plate_z

[float64] Vertical position of the ball when it crosses home plate from the catcher’s perspective.

Downstream Pitcher Controlled

Theses variables are pitch characteristics, and maybe somewhat controlled by the pitcher, but are largely functions of the previous variables.

pfx_x

[float64] Horizontal movement in feet from the catcher’s perspective.

pfx_z

[float64] Vertical movement in feet from the catcher’s perspective.

Situational Information

These variables describe part of the game situation when the pitch was thrown. (We have omitted some other obvious variables here like score and inning for simplicity.) These are fixed before a pitch is thrown, but could have an effect. Pitchers and batters often act differently based on the game situation. For example, batters are known to “protect” when there are two strikes, thus, are much more likely to swing.

balls

[int64] Pre-pitch number of balls in count.

strikes

[int64] Pre-pitch number of strikes in count.

on_3b

[int64] Pre-pitch MLB Player Id of Runner on 3B.

on_2b

[int64] Pre-pitch MLB Player Id of Runner on 2B.

on_1b

[int64] Pre-pitch MLB Player Id of Runner on 1B.

outs_when_up

[int64] Pre-pitch number of outs.

Fixed Batter Information

These variables give some information about the batter facing the pitcher. In particular, are they a righty or lefty, and the size of their strike zone, which is a function of their height.

stand

[object] Side of the plate batter is standing.

sz_top

[float64] Top of the batter’s strike zone set by the operator when the ball is halfway to the plate.

sz_bot

[float64] Bottom of the batter’s strike zone set by the operator when the ball is halfway to the plate.

Data in Python

To load the data in Python, use:

import pandas as pd

pitches_train = pd.read_csv(
    "https://cs307.org/lab-09/data/pitches-train.csv",
)
pitches_test = pd.read_csv(
    "https://cs307.org/lab-09/data/pitches-test.csv",
)

Prepare Data for Machine Learning

Because the data is already train-test split, we can simply create the X and y variants of the data.

# create X and y for train data
X_train = pitches_train.drop(columns=["swing"])
y_train = pitches_train["swing"]

# create X and y for test data
X_test = pitches_test.drop(columns=["swing"])
y_test = pitches_test["swing"]

You can assume that within the autograder, similar processing is performed on the production data.

Sample Statistics

Before modeling, be sure to look at the data. Calculate the summary statistics requested on PrairieLearn and create a visualization for your report.

Models

For this lab you will select one model to submit to the autograder. You may use any modeling techniques you’d like. The only rules are:

Models must start from the given training data, unmodified.
- Importantly, the types and shapes of X_train and y_train should not be changed.
- In the autograder, we will call mod.predict(X_test) on your model, where your model is loaded as mod and X_test has a compatible shape with and the same variable names and types as X_train.
- In the autograder, we will call mod.predict(X_prod) on your model, where your model is loaded as mod and X_prod has a compatible shape with and the same variable names and types as X_train.
- We assume that you will use a Pipeline and GridSearchCV from sklearn as you will need to deal with heterogeneous data, and you should be using cross-validation.
  - So more specifically, you should create a Pipeline that is fit with GridSearchCV. Done correctly, this will store a “model” that you can submit to the autograder.
Your model must have a fit method.
Your model must have a predict method.
Your model must have a predict_proba method.
Your model should be created with scikit-learn version 1.5.2 or newer.
Your model should be serialized with joblib version 1.4.2 or newer.
- Your serialized model must be less than 5MB.

While you can use any modeling technique, each lab is designed such that a model using only techniques seen so far in the course can pass the checks in the autograder.

To obtain the maximum points via the autograder, your model performance must meet or exceed:

Test ECE: 0.075
Test MCE: 0.135
Production ECE: 0.075
Production ECE: 0.145

The production data is data that will mimic data that is passed through your model after you have put it into production, that is, it is being used for the stated goal within the scenario of the lab. As such, you do not have access to it. You do however have access to the test data.

Probability Calibration

What are the metrics ECE and MCE? How are we evaluating models here?

Models will not be evaluated on their ability to classify to swing or not. Instead, we will directly asses their ability to estimate the probability of a swing. Thus, you need a well-calibrated model.

sklearn: Probability Calibration

The above sklearn user guide page will provide some hints. Importantly, you may need to use CalibratedClassifierCV to further calibrate the probability estimates from a classifier.

In the autograder, we will use two metrics to assess your submitted model:

Expected Calibration Error (ECE): This is essentially an average of the distance the points on a calibration plot are from the “perfect” line.
Maximum Calibration Error (MCE): This is essentially the furthest any point on a calibration plot is from the “perfect” line.

We provide Python functions for creating calibration plots and calculating these metrics.

calibration.py

For further reference:

Classifier calibration: a survey on how to assess and improve predicted class probabilities

Model Persistence

To save your model for submission to the autograder, use the dump function from the joblib library. Check PrairieLearn for the filename that the autograder expects for this lab.

from joblib import dump
dump(mod, "filename.joblib")

Because the models for this lab could be quite large, consider using the compress parameter to the dump function!

Discussion

As always, be sure to state a conclusion, that is, whether or not you would use the model you trained and selected for the real world scenario described at the start of the lab! Justify your conclusion. If you trained multiple models that are mentioned in your report, first make clear which model you selected and are considering for use in practice.

Additional discussion topics:

If you tried to use this model as an MLB coach, which variables would you ask the pitcher to modify to induce a swing? Why?

When answering discussion prompts: Do not simply answer the prompt! Answer the prompt, but write as if the prompt did not exist. Write your report as if the person reading it did not have access to this document!

Template Notebook

Lab 09: Template Notebook

Submission

Before submission, especially of your report, you should be sure to review the Lab Policy page.

On Canvas, be sure to submit both your source .ipynb file and a rendered .html version of the report.