import numpy as np
np.random.seed(42)
n = 200
X = np.random.uniform(low=-2*np.pi, high=2*np.pi, size=(n, 1))
y = np.sin(X).ravel() + np.random.normal(loc=0, scale=0.25, size=n)
X.shape, y.shape # verify shapes of the data
((200, 1), (200,))
A First Method for Regression
import numpy as np
np.random.seed(42)
n = 200
X = np.random.uniform(low=-2*np.pi, high=2*np.pi, size=(n, 1))
y = np.sin(X).ravel() + np.random.normal(loc=0, scale=0.25, size=n)
X.shape, y.shape # verify shapes of the data
((200, 1), (200,))
In general, when working with ML data, especially with sklearn
:
X
will contain the featuresy
will contain the targetNotice that X
here is a two-dimensional array1, while y
is one-dimensional. This will generally be the setup when working in numpy
.
x | y | |
---|---|---|
0 | -1.576575 | -1.169989 |
1 | 5.663843 | -0.522436 |
2 | 2.915322 | 0.297613 |
3 | 1.239779 | 0.767124 |
4 | -4.322597 | 1.391432 |
... | ... | ... |
195 | -1.894888 | -1.299806 |
196 | 2.839443 | 0.117962 |
197 | 4.990235 | -1.015010 |
198 | 4.864271 | -0.910761 |
199 | 3.517020 | 0.002169 |
200 rows × 2 columns
We want to learn the conditional distribution of \(Y\) given \(\boldsymbol{X}\).
\[ Y \mid \boldsymbol{X} = \boldsymbol{x} \]
Often, the best we can do is learn the conditional mean of \(Y\) given \(\boldsymbol{X}\).
\[ \mathbb{E}[Y \mid \boldsymbol{X} = \boldsymbol{x}] \]
This is more theoretical than we really need to use machine learning.
Given a new \(X\), predict \(Y\). Remember: You only have access to the image on the right.
To predict \(y\) at \(x = 0\), find the “nearby” samples in the data. (With respect to \(x\).) Predict the average of the \(y\) values of these samples.
Two questions:
Answers:
For some \(k\), which determines the number of neighbors to use, predict \(y\) using
\[ \hat{\mu}(x) = \frac{1}{k} \sum_{ \{i \ : \ x_i \in \mathcal{N}_k(x, \mathcal{D}) \} } y_i. \]
Here, we have:
Use RMSE or another regression metric!
Root Mean Squared Error (\(\text{RMSE}\), rmse
)
\[ \text{RMSE}(y, \hat{y}) = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2} \]
Let’s fit with a couple more values of \(k\) and see which is best.
RMSE for k = 1: 0.0
RMSE for k = 10: 0.23366221486551458
RMSE for k = 100: 0.764849529471475
A model generalizes well if it is able to make good predictions on new data. Predicting on already observed data is easy!
\[ \Huge Y = f(X) + \epsilon \]
When we overfit, which is one way to generalize poorly, we have accidentally learned the noise.
In practice, we cannot simply simulate more data!
However, the “easy” solution here is to take the available data and (randomly) use some of the data for training, that is the data used to fit a model. Then, (randomly) reserve some other data for testing. That is, the data we will calculate metrics on.
But, we actually need to go one step further…
An 80-20 split is common, but not required. The choice of how much data to put into each set is called data budgeting.
Now that we have a desire to fit a model to some data, but calculate metrics based on other data, we need to update our metric definitions.
Root Mean Squared Error (\(\text{RMSE}\), rmse
)
\[ \text{RMSE}(f, \mathcal{D}) = \sqrt{ \frac{1}{n_\mathcal{D}} \sum_{i \in \mathcal{D}} \left( y_i - f(x_i) \right) ^ 2} \]
Root Mean Squared Error (\(\text{RMSE}\), rmse
)
\[ \text{RMSE}(f, \mathcal{D}) = \sqrt{ \frac{1}{n_\mathcal{D}} \sum_{i \in \mathcal{D}} \left( y_i - f(x_i) \right) ^ 2} \]
Importantly, now we need to consider both a dataset and function (learned from data) when calculating metrics. We’re still comparing “true” values to “predicted” values, but we need to pay attention to where they come from.
Root Mean Squared Error (\(\text{RMSE}\), rmse
)
\[ \text{RMSE}(f, \mathcal{D}) = \sqrt{ \frac{1}{n_\mathcal{D}} \sum_{i \in \mathcal{D}} \left( y_i - f(x_i) \right) ^ 2} \]
Here:
sklearn
, a function like some_model.predict()
.Validation metrics…
Test metrics…
We’ll return to our simulated data from earlier.
We’ll also load some additional libraries we’ll need.
We first split the full data into a train and test set.
We then repeat the process to split the train set into a validation-train and validation set.
.fit()
KNeighborsRegressor(n_neighbors=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
KNeighborsRegressor(n_neighbors=1)
KNeighborsRegressor(n_neighbors=10)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
KNeighborsRegressor(n_neighbors=10)
KNeighborsRegressor(n_neighbors=100)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
KNeighborsRegressor(n_neighbors=100)
.predict()
array([-0.56366024, 0.31900317, -0.67916829, -0.92808265, -0.86472793,
0.17168979, -0.67916829, -0.08602215, -0.93826351, -0.55504094,
0.17168979, -0.46108356, -0.55645133, 0.44828237, -0.46108356,
-0.36936917, 0.22611211, -0.46108356, -0.90564424, -0.56366024,
0.94364577, 0.09707822, 0.91014566, 0.22611211, -0.90590527,
-0.36936917, -0.08602215, 0.22611211, -0.89805858, -0.90590527,
0.09707822, 0.25873394])
def rmse(y_true, y_pred):
return np.sqrt(np.mean((y_true - y_pred) ** 2))
rmse_val_001 = rmse(y_validation, knn001.predict(X_validation))
rmse_val_010 = rmse(y_validation, knn010.predict(X_validation))
rmse_val_100 = rmse(y_validation, knn100.predict(X_validation))
print("Validation RMSE for k = 1:", rmse_val_001)
print("Validation RMSE for k = 10:", rmse_val_010)
print("Validation RMSE for k = 100:", rmse_val_100)
Validation RMSE for k = 1: 0.3279636107055588
Validation RMSE for k = 10: 0.24230224459351157
Validation RMSE for k = 100: 0.682208033130319
Based on these results, we would select a \(k\) value of 10. This will be our chosen model.
The general procedure for supervised learning is:
In general,
Next week we’ll add one more layer of complexity to this procedure that we will use the rest of the semester: cross-validation.