Model Flexibility and Generalization

What Causes Overfitting?

David Dalpiaz

Generalization

The Goal of Supervised Learning

In supervised learning, we want to want a model that generalizes well.

Generalization in supervised learning refers to a model’s ability to adapt to new, previously unseen data.1 That is a model fit to training data that can make good predictions, for data not part of the training data!

So far the validation and test sets have given us some insight in model performance from this perspective.

Simulated Data

import numpy as np
np.random.seed(42)
n = 100
X = np.random.uniform(low=-2*np.pi, high=2*np.pi, size=(n, 1))
y = np.sin(X).ravel() + np.random.normal(loc=0, scale=0.25, size=n)

Three Simulated Simulated Datasets

np.random.seed(42)
X_01 = np.random.uniform(low=-2*np.pi, high=2*np.pi, size=(n, 1))
y_01 = np.sin(X_01).ravel() + np.random.normal(loc=0, scale=0.25, size=n)

np.random.seed(307)
X_02 = np.random.uniform(low=-2*np.pi, high=2*np.pi, size=(n, 1))
y_02 = np.sin(X_02).ravel() + np.random.normal(loc=0, scale=0.25, size=n)

np.random.seed(1337)
X_03 = np.random.uniform(low=-2*np.pi, high=2*np.pi, size=(n, 1))
y_03 = np.sin(X_03).ravel() + np.random.normal(loc=0, scale=0.25, size=n)

Model Flexibility

Which of these is most wiggly?

Model Flexibility

A model’s flexibility is its ability to capture patterns in (training) data. The more flexible a model, the more closely it will fit training data.

In regression, a good mental model for flexibility is wiggliness.

In the previous example:

  • \(k = 1\): Very flexible!
  • \(k = 10\): Reasonably flexible!
  • \(k = 100\): Rather inflexible!

Overfitting

Overfitting is a situation where a model learns the noise in the training data.

  • In other words: The model has learned the specific training data, rather than a general pattern.
  • A model that is overfit is too specific to the training data.
  • A model that is overfit will perform poorly on new data.

What causes overfitting?

Too Flexible

In the real-world, we often like flexibility. In machine learning, too much flexibility is bad.

  • Overly flexible models lead to overfitting. The more flexible a model, the better it can learn the training data.
  • If a model is too inflexible, this will underfit the data, and not extract enough signal.

Flexibility is often relative to the amount of data available. With a lot of data, you can use a more flexible model.

How do we determine the correct amount of flexibility?

Tuning Parameters

Many machine learning models have tuning parameters that either directly control flexibility, or implicitly impact flexibility.

With KNN, the \(k\) is a tuning parameter that controls the model flexibility. Recall our example:

  • \(k = 1\): Very flexible!
  • \(k = 10\): Reasonably flexible!
  • \(k = 100\): Rather inflexible!

Tuning Parameters

Tuning parameters also define how a model learns from data. This is in contracts to model parameters, like the \(\beta\) in linear regression, that are learned from data.

But we want to learn a good value of the tuning parameters to find an appropriate flexibility of a model.

How?

Hyperparameter Tuning

The process of finding “good” values for tuning parameters is known as hyperparameter tuning.

  • It is a somewhat meta learning process.
    • Specific models are directly “learned” from data.
      • We learn to make predictions using KNN with \(k = 1\).
    • Tuning parameters are “learned” by checking how specific models behave
      • We fit many KNN models with different \(k\) and then pick one value for \(k\).

(Current) Hyperparameter Tuning Process

  • Train-Test split the data
  • Split the Train data into Validation-Train and Validation
  • Fit several candidate models to Validation-Train
    • For example, KNN with many values of \(k\)
  • For each model fit, calculate a Validation metric
  • Select the value of \(k\) that gives the best Validation metric
    • Fit this model to the full train data. Calculate metrics on test data for a final quantification of performance.

This has a problem…

Validation Curve

Hyperparameter Tuning with Code

Some simulations:

Takeaway: By using cross-validation, the hyperparameter tuning process is much more stable. In particular, there is less variance is the selected values of the tuning parameters.

Cross-Validation

Cross-Validation Pipeline

Bias-Variance Tradeoff

What makes all this work? Why does flexibility lead to overfitting?

The bias-variance tradeoff is a fundamental machine learning concept that describes the balance between (statistical) bias and variance of models.

  • Flexible models are low bias, but highly variable.
  • Inflexible models are low variance, but highly biased.

We’d like a model with no bias and no variance, but that is not possible, there is a tradeoff!

Hyperparameter tuning helps us find the right balance.

Bias-Variance Tradeoff

Validation and Test Data

Why do we need both a validation and test dataset?

  1. A separate test set provides a gut-check to make sure you cross-validated correctly.
  2. It eliminates a subtle optimistic bias in model selection!
    • By using the validation sets to evaluate your models, you still saw that data!

In geneal:

  • Test: performance quantification
  • (Cross) Validation: hyperparameter tuning