k-Nearest Neighbors and Data Splitting

A First Method for Regression

David Dalpiaz

The Regression Setup

ML Tasks

Supervised Learning
- Classification
- Regression
Unsupervised Learning
- Clustering
- Novelty and Outlier Detection
Reinforcement Learning

Simulated Data

import numpy as np
np.random.seed(42)
n = 200
X = np.random.uniform(low=-2*np.pi, high=2*np.pi, size=(n, 1))
y = np.sin(X).ravel() + np.random.normal(loc=0, scale=0.25, size=n)
X.shape, y.shape # verify shapes of the data

((200, 1), (200,))

In general, when working with ML data, especially with sklearn:

X will contain the features
y will contain the target

Notice that X here is a two-dimensional array¹, while y is one-dimensional. This will generally be the setup when working in numpy.

Simulated Data: Tabular View

	x	y
0	-1.576575	-1.169989
1	5.663843	-0.522436
2	2.915322	0.297613
3	1.239779	0.767124
4	-4.322597	1.391432
...	...	...
195	-1.894888	-1.299806
196	2.839443	0.117962
197	4.990235	-1.015010
198	4.864271	-0.910761
199	3.517020	0.002169

200 rows × 2 columns

Simulated Data: Graphical View

The Goal of Regression

We want to learn the conditional distribution of \(Y\) given \(\boldsymbol{X}\).

\[ Y \mid \boldsymbol{X} = \boldsymbol{x} \]

Often, the best we can do is learn the conditional mean of \(Y\) given \(\boldsymbol{X}\).

\[ \mathbb{E}[Y \mid \boldsymbol{X} = \boldsymbol{x}] \]

This is more theoretical than we really need to use machine learning.

The Goal of Regression in ML

Given a new \(X\), predict \(Y\). Remember: You only have access to the image on the right.

When \(x = -\pi\), what should we predict for \(y\)?
When \(x = 0\), what should we predict for \(y\)?
When \(x = \pi\), what should we predict for \(y\)?

Introducing KNN

Big Idea: Use “Nearby” Points

To predict \(y\) at \(x = 0\), find the “nearby” samples in the data. (With respect to \(x\).) Predict the average of the \(y\) values of these samples.

How?

Two questions:

What defines near?
How many samples should we use?

Answers:

Distance, usually the Euclidean distance.
Good question!

k-Nearest Neighbors

For some \(k\), which determines the number of neighbors to use, predict \(y\) using

\[ \hat{\mu}(x) = \frac{1}{k} \sum_{ \{i \ : \ x_i \in \mathcal{N}_k(x, \mathcal{D}) \} } y_i. \]

Here, we have:

\(\mathcal{D} = \{ (x_i, y_i) \in \mathbb{R}^p \times \mathbb{R}, \ i = 1, 2, \ldots n \}\), the data used during fitting.
\(\{i \ : \ x_i \in \mathcal{N}_k(x, \mathcal{D}) \}\), the indexes of the data that contain a value that is among the nearest neighbors to the point.

KNN Fit to the Simulated Sin Data

Quantifying Performance

Use RMSE or another regression metric!

Root Mean Squared Error (\(\text{RMSE}\), rmse)

\[ \text{RMSE}(y, \hat{y}) = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2} \]

Let’s fit with a couple more values of \(k\) and see which is best.

KNN with k = (1, 10, 100)

RMSE for k = 1: 0.0
RMSE for k = 10: 0.23366221486551458
RMSE for k = 100: 0.764849529471475

Generalization

A model generalizes well if it is able to make good predictions on new data. Predicting on already observed data is easy!

Signal and Noise

\[ \Huge Y = f(X) + \epsilon \]

We want to learn the signal, \(f\).
We do not want to learn the noise.

When we overfit, which is one way to generalize poorly, we have accidentally learned the noise.

Data Splitting

Train-Test Split

In practice, we cannot simply simulate more data!

However, the “easy” solution here is to take the available data and (randomly) use some of the data for training, that is the data used to fit a model. Then, (randomly) reserve some other data for testing. That is, the data we will calculate metrics on.

But, we actually need to go one step further…

Train-Validation-Test Split

Train: The data used to train and select models.
- Validation-Train: The data used to fit models during training.
- Validation: The data used to evaluate models during training.
Test: The data used only for a final evaluation of an already chosen model.

Train-Validation-Test Split Flowchart

flowchart LR
  A("Full Data") 
  A -->|"80%"| B("Train Data")
  B -->|"80%"| D("Validation-Train Data")
  B -->|"20%"| E("Validation Data")
  A -->|"20%"| C("Test Data")

An 80-20 split is common, but not required. The choice of how much data to put into each set is called data budgeting.

Validation and Test Metrics

Now that we have a desire to fit a model to some data, but calculate metrics based on other data, we need to update our metric definitions.

Root Mean Squared Error (\(\text{RMSE}\), rmse)

\[ \text{RMSE}(f, \mathcal{D}) = \sqrt{ \frac{1}{n_\mathcal{D}} \sum_{i \in \mathcal{D}} \left( y_i - f(x_i) \right) ^ 2} \]

Validation and Test Metrics

Root Mean Squared Error (\(\text{RMSE}\), rmse)

\[ \text{RMSE}(f, \mathcal{D}) = \sqrt{ \frac{1}{n_\mathcal{D}} \sum_{i \in \mathcal{D}} \left( y_i - f(x_i) \right) ^ 2} \]

Importantly, now we need to consider both a dataset and function (learned from data) when calculating metrics. We’re still comparing “true” values to “predicted” values, but we need to pay attention to where they come from.

Validation and Test Metrics

Root Mean Squared Error (\(\text{RMSE}\), rmse)

\[ \text{RMSE}(f, \mathcal{D}) = \sqrt{ \frac{1}{n_\mathcal{D}} \sum_{i \in \mathcal{D}} \left( y_i - f(x_i) \right) ^ 2} \]

Here:

\(f\) is a function that outputs predictions. In sklearn, a function like some_model.predict().
\(\mathcal{D}\) is a dataset, usually either the validation or test data.
\(n_\mathcal{D}\) is the number of observations in the dataset \(\mathcal{D}\).
\(i\) is the index of an observation (row) of the dataset \(\mathcal{D}\). \(x_i\) is the feature value(s) for this observation and \(y_i\) is the target value.

Validation and Test Metrics

Validation metrics…

Use models fit to validation-train data.
Use validation data.

Test metrics…

Use models fit to train data.
Use test data

Example

Data Setup

We’ll return to our simulated data from earlier.

import numpy as np
np.random.seed(42)
n = 200
X = np.random.uniform(low=-2*np.pi, high=2*np.pi, size=(n,1))
y = np.sin(X).ravel() + np.random.normal(loc=0, scale=0.25, size=n)

X.shape, y.shape # verify shapes of the data

((200, 1), (200,))

We’ll also load some additional libraries we’ll need.

from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

Data Splitting

We first split the full data into a train and test set.

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.20, random_state=42
)

X_train.shape, X_test.shape # verify shapes of the data

((160, 1), (40, 1))

We then repeat the process to split the train set into a validation-train and validation set.

X_vtrain, X_validation, y_vtrain, y_validation = train_test_split(
    X_train, y_train, test_size=0.20, random_state=42
)

X_vtrain.shape, X_validation.shape # verify shapes of the data

((128, 1), (32, 1))

Create Models

knn001 = KNeighborsRegressor(n_neighbors=1)
knn010 = KNeighborsRegressor(n_neighbors=10)
knn100 = KNeighborsRegressor(n_neighbors=100)

sklearn: KNeighborsRegressor

Fit Models using `.fit()`

knn001.fit(X_vtrain, y_vtrain)

KNeighborsRegressor(n_neighbors=1)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

knn010.fit(X_vtrain, y_vtrain)

KNeighborsRegressor(n_neighbors=10)

knn100.fit(X_vtrain, y_vtrain)

KNeighborsRegressor(n_neighbors=100)

Predictions using `.predict()`

knn010.predict(X_validation)

array([-0.56366024,  0.31900317, -0.67916829, -0.92808265, -0.86472793,
        0.17168979, -0.67916829, -0.08602215, -0.93826351, -0.55504094,
        0.17168979, -0.46108356, -0.55645133,  0.44828237, -0.46108356,
       -0.36936917,  0.22611211, -0.46108356, -0.90564424, -0.56366024,
        0.94364577,  0.09707822,  0.91014566,  0.22611211, -0.90590527,
       -0.36936917, -0.08602215,  0.22611211, -0.89805858, -0.90590527,
        0.09707822,  0.25873394])

knn010.predict(X_validation).shape

(32,)

Validation Metrics

def rmse(y_true, y_pred):
    return np.sqrt(np.mean((y_true - y_pred) ** 2))

rmse_val_001 = rmse(y_validation, knn001.predict(X_validation))
rmse_val_010 = rmse(y_validation, knn010.predict(X_validation))
rmse_val_100 = rmse(y_validation, knn100.predict(X_validation))

print("Validation RMSE for k = 1:", rmse_val_001)
print("Validation RMSE for k = 10:", rmse_val_010)
print("Validation RMSE for k = 100:", rmse_val_100)

Validation RMSE for k = 1: 0.3279636107055588
Validation RMSE for k = 10: 0.24230224459351157
Validation RMSE for k = 100: 0.682208033130319

Based on these results, we would select a \(k\) value of 10. This will be our chosen model.

Refit and Calculate Test Metric

# refit to train data
knn010.fit(X_train, y_train)

# calculate test RMSE
rmse_test_010 = rmse(y_test, knn010.predict(X_test))

# print
print("Test RMSE for k = 10:", rmse_test_010)

Test RMSE for k = 10: 0.2846290999172462

Visualizing Test Results

Recapping the Process

The general procedure for supervised learning is:

Train-test split the full data.
Further split the train data into validation-train and validation datasets.
Fit all candidate models (in this case the three KNN models) to the validation-train dataset.
Calculate validation RMSE. With the validation data, calculate the RMSE for predictions from these models.
Select the model with the lowest validation RMSE.
Fit the selected model to the train dataset.
Calculate test RMSE. With the test data, calculate the RMSE for predictions from the model fit to the train data.

Summarizing Validation versus Test

In general,

Validation RMSE is for selecting models, often via their tuning parameters.
Test RMSE is for reporting the performance of a selected model.

Next week we’ll add one more layer of complexity to this procedure that we will use the rest of the semester: cross-validation.

k-Nearest Neighbors and Data Splitting

The Regression Setup

ML Tasks

Simulated Data

Simulated Data: Tabular View

Simulated Data: Graphical View

The Goal of Regression

The Goal of Regression in ML

Introducing KNN

Big Idea: Use “Nearby” Points

How?

k-Nearest Neighbors

KNN Fit to the Simulated Sin Data

Quantifying Performance

KNN with k = (1, 10, 100)

Generalization

Signal and Noise

Data Splitting

Train-Test Split

Train-Validation-Test Split

Train-Validation-Test Split Flowchart

Validation and Test Metrics

Validation and Test Metrics

Validation and Test Metrics

Validation and Test Metrics

Example

Data Setup

Data Splitting

Create Models

Fit Models using .fit()

Predictions using .predict()

Validation Metrics

Refit and Calculate Test Metric

Visualizing Test Results

Recapping the Process

Summarizing Validation versus Test

That’s All Folks!

Fit Models using `.fit()`

Predictions using `.predict()`