What Causes Overfitting?
In supervised learning, we want to want a model that generalizes well.
Generalization in supervised learning refers to a model’s ability to adapt to new, previously unseen data.1 That is a model fit to training data that can make good predictions, for data not part of the training data!
So far the validation and test sets have given us some insight in model performance from this perspective.
np.random.seed(42)
X_01 = np.random.uniform(low=-2*np.pi, high=2*np.pi, size=(n, 1))
y_01 = np.sin(X_01).ravel() + np.random.normal(loc=0, scale=0.25, size=n)
np.random.seed(307)
X_02 = np.random.uniform(low=-2*np.pi, high=2*np.pi, size=(n, 1))
y_02 = np.sin(X_02).ravel() + np.random.normal(loc=0, scale=0.25, size=n)
np.random.seed(1337)
X_03 = np.random.uniform(low=-2*np.pi, high=2*np.pi, size=(n, 1))
y_03 = np.sin(X_03).ravel() + np.random.normal(loc=0, scale=0.25, size=n)
Which of these is most wiggly?
A model’s flexibility is its ability to capture patterns in (training) data. The more flexible a model, the more closely it will fit training data.
In regression, a good mental model for flexibility is wiggliness.
In the previous example:
Overfitting is a situation where a model learns the noise in the training data.
What causes overfitting?
In the real-world, we often like flexibility. In machine learning, too much flexibility is bad.
Flexibility is often relative to the amount of data available. With a lot of data, you can use a more flexible model.
How do we determine the correct amount of flexibility?
Many machine learning models have tuning parameters that either directly control flexibility, or implicitly impact flexibility.
With KNN, the \(k\) is a tuning parameter that controls the model flexibility. Recall our example:
Tuning parameters also define how a model learns from data. This is in contracts to model parameters, like the \(\beta\) in linear regression, that are learned from data.
But we want to learn a good value of the tuning parameters to find an appropriate flexibility of a model.
How?
The process of finding “good” values for tuning parameters is known as hyperparameter tuning.
This has a problem…
Some simulations:
Takeaway: By using cross-validation, the hyperparameter tuning process is much more stable. In particular, there is less variance is the selected values of the tuning parameters.
What makes all this work? Why does flexibility lead to overfitting?
The bias-variance tradeoff is a fundamental machine learning concept that describes the balance between (statistical) bias and variance of models.
We’d like a model with no bias and no variance, but that is not possible, there is a tradeoff!
Hyperparameter tuning helps us find the right balance.
Why do we need both a validation and test dataset?
In geneal: