Lab 08: MLB Swing Probability Modeling

Scenario: Suppose you work for a Major League Baseball (MLB) team as a part of their analytics department. For a variety of reasons, it is useful to know if a batter will swing at a particular pitch, even independent of outcome. You are tasked with developing a well calibrated probability model that estimates the probability of inducing a swing given the characteristics of a pitch, in addition to other information such as game situation, for a particular pitcher. (In practice, you would build one model per pitcher.)

Goal

The goal of this lab is to develop a model to estimate the probability that an MLB batter swings at a pitch thrown by a particular pitcher.

Data

This lab will use data from Statcast. We will focus on a model for a specific pitcher, Zac Gallen.

In this lab, the train-test split is done within the 2023 MLB Season. That is, the training data is Zac Gallen pitches that occured between opening day (2023-03-30) and the trade deadline (2023-08-31). The test data covers the remainder of the season, from September 1 (2023-09-01) to the final day of the regular season (2023-10-02). Hidden from you is the production data which covers the postseason (2023-10-03 to 2023-11-01).

We do this in place of randomly splitting the data in an attempt to create a model that can predict into the future. Imagine this model is created on the final day of the regular season, then possibly used to make baseball decisions during the playoffs.

Because Statcast data can change at any moment, it is constantly changed and improved, we provide a snapshot of the data for use in this lab.

Each sample is a pitch thrown by Zac Gallen.

Response

swing

  • [int64] Whether or not the batter swung.

Features

While we will certainly not be able to make any truly causal claims about our model, it is important to understand which variables are controlled by the pitcher. We could imagine a coach using this model to help explain to a pitcher where and how to throw a pitch if they want to induce a swing.

Fully Pitcher Controlled

This variable is fully controlled by the pitcher.

pitch_name

  • [object] The name of the pitch type to be thrown.

Mostly Pitcher Controlled

These variables are largely controlled by the pitcher, but even at the highest levels of baseball, there will be variance based on skill, fatigue, etc.

release_extension

  • [float64] Release extension of pitch in feet as tracked by Statcast.

release_pos_x

  • [float64] Horizontal Release Position of the ball measured in feet from the catcher’s perspective.

release_pos_y

  • [float64] Release position of pitch measured in feet from the catcher’s perspective.

release_pos_z

  • [float64] Vertical Release Position of the ball measured in feet from the catcher’s perspective.

Somewhat Pitcher Controlled

These variables are in some sense controlled by the pitcher, but less so than the previous variables. At the MLB level, pitchers will have some control here, but even at the highest levels, there can be a lot of variance.

release_speed

  • [float64] Velocity of the pitch thrown.

release_spin_rate

  • [float64] Spin rate of pitch tracked by Statcast.

spin_axis

  • [float64] The spin axis in the 2D X-Z plane in degrees from 0 to 360, such that 180 represents a pure backspin fastball and 0 degrees represents a pure topspin (12-6) curveball.

plate_x

  • [float64] Horizontal position of the ball when it crosses home plate from the catcher’s perspective.

plate_z

  • [float64] Vertical position of the ball when it crosses home plate from the catcher’s perspective.

Downstream Pitcher Controlled

Theses variables are pitch characteristics, and maybe somewhat controlled by the pitcher, but are largely functions of the previous variables.

pfx_x

  • [float64] Horizontal movement in feet from the catcher’s perspective.

pfx_z

  • [float64] Vertical movement in feet from the catcher’s perspective.

Situational Information

These variables describe part of the game situation when the pitch was thrown. (We have omitted some other obvious variables here like score and inning, just for simplicity.) These are fixed before a pitch is thrown, but could have an effect. Pitchers and batters often act differently based on the game situation. For example, batters are known to “protect” when there are two strikes, thus, much more likely to swing.

balls

  • [int64] Pre-pitch number of balls in count.

strikes

  • [int64] Pre-pitch number of strikes in count.

on_3b

  • [int64] Pre-pitch MLB Player Id of Runner on 3B.

on_2b

  • [int64] Pre-pitch MLB Player Id of Runner on 2B.

on_1b

  • [int64] Pre-pitch MLB Player Id of Runner on 1B.

outs_when_up

  • [int64] Pre-pitch number of outs.

Fixed Batter Information

These variables give some information about the batter facing the pitch. In particular, are they a righty or lefty, and the size of their strike zone, which is a function of their height.

stand

  • [object] Side of the plate batter is standing.

sz_top

  • [float64] Top of the batter’s strike zone set by the operator when the ball is halfway to the plate.

sz_bot

  • [float64] Bottom of the batter’s strike zone set by the operator when the ball is halfway to the plate.

Data in Python

To load the data in Python, use:

import pandas as pd
pitches_train = pd.read_csv("https://cs307.org/lab-08/data/pitches-train.csv")
pitches_test = pd.read_csv("https://cs307.org/lab-08/data/pitches-test.csv")

Prepare Data for Machine Learning

Because the data is already train-test split, we can simply create the X and y variants of the data.

# create X and y for train data
X_train = pitches_train.drop(columns=["swing"])
y_train = pitches_train["swing"]

# create X and y for test data
X_test = pitches_test.drop(columns=["swing"])
y_test = pitches_test["swing"]

Sample Statistics

Before modeling, be sure to look at the data. Calculate the summary statistics requested on PrairieLearn and create a visualization for your report.

Models

For this lab, you will need to train two separate but related models.

Probability Model: Train a supervised (likely classifier) to predict swing from the other variables. However, this model will not be evaluated on its ability to classify to swing or not. Instead, we will directly asses its ability to estimate the probability of a swing. Thus, you need a well-calibrated model.

The above sklearn user guide page will provide some hints. Importantly, you may need to use CalibratedClassifierCV to further calibrate the probability estimates from a classifier.

In the autograder, we will use two metrics to assess your submitted model:

  • Expected Calibration Error (ECE): This is essentially how far on average the points on a calibration plot are from the “perfect” line.
  • Maximum Calibration Error (MCE): This is essentially the furthest any point on a calibration plot is from the “perfect” line.

We provide Python functions for creating calibration plots and calculating these metrics.

For further reference:

Novelty Detector: The second model you will train is an unsupervised novelty detector. It should be fit to the training features only. We will test how many observations it flags as novel (outliers) in the test data. Your detector should detect at least one novel observation in the test and production data, but flag no more than 5% of the observations. Use 1 for inliers and -1 for outlier as is the default in sklearn.

For this lab, you may train models however you’d like! The only rules are:

Probability Model:

  • Models must start from the given training data, unmodified.
    • Importantly, the types and shapes of X_train and y_train should not be changed.
    • In the autograder, we will call mod.predict(X_test) on your model, where your model is loaded as mod and X_test has a compatible shape with and the same variable names and types as X_train.
    • We assume that you will use a Pipeline and GridSearchCV from sklearn as you will need to deal with heterogeneous data, and you should be using cross-validation.
      • So more specifically, you should create a Pipeline that is fit with GridSearchCV. Done correctly, this will store a “model” that you can submit to the autograder.
  • Your model must have a fit method.
  • Your model must have a predict method that returns numbers.
  • Your model must have a predict_proba method that returns numbers.
  • Your model should be created with scikit-learn version 1.4.0 or newer.
  • Your model should be serialized with joblib version 1.3.2 or newer.
    • Your serialized model must be less than 5MB.

Novelty Detector:

  • Models must start from the given training data, unmodified.
    • Importantly, the types and shapes of X_train and y_train should not be changed.
    • In the autograder, we will call mod.predict(X_test) on your model, where your model is loaded as mod and X_test has a compatible shape with and the same variable names and types as X_train.
  • Your model should be created with scikit-learn version 1.4.0 or newer.
  • Your model should be serialized with joblib version 1.3.2 or newer.
    • Your serialized model must be less than 5MB.

While you can use any modeling technique, each lab is designed such that a model using only techniques seen so far in the course can pass the checks in the autograder.

To obtain the maximum points via the autograder, your model performance must meet or exceed:

Test ECE: 0.075
Test MCE: 0.135
Test Proportion Novel: 0.05
Production ECE: 0.075
Production ECE: 0.135
Production Proportion Novel: 0.05

The production data is data that will mimic data that is passed through your model after you have put it into production, that is, it is being used for the stated goal within the scenario of the lab. As such, you do not have access to it. You do however have access to the test data.

Model Persistence

To save your models for submission to the autograder, use the dump function from the joblib library. Check PrairieLearn for the filename that the autograder expects.

Because the models for this lab could be quite large, consider using the compress parameter to the dump function!

Discussion

As always, be sure to state a conclusion, that is, whether or not you would use the model you trained and selected for the real world scenario described at the start of the lab! If you are asked to train multiple models, first make clear which model you selected and are considering for use in practice. Discuss any limitations or potential improvements.

Additional discussion topics:

  • If you tried to use this model as an MLB coach, which variables would you ask the pitcher to modify to induce a swing? Why?

When answering discussion prompts: Do not simply answer the prompt! Answer the prompt, but write as if the prompt did not exist. Write your report as if the person reading it did not have access to this document!

Template Notebook

Submission

Before submission, especially of your report, you should be sure to review the Lab Policy page!

On Canvas, be sure to submit both your source .ipynb file and a rendered .html version of the report.