Dummy Methods
The simplest methods that we can consider take input data, ignores any potential feature variables, and uses only the target variable to learn how to make future predictions.
These methods are not so politely called dummy methods. While they may not perform well, they are important as they establish a baseline for performance.
Remember The Penguins?
65 |
Adelie |
41.6 |
18.0 |
192.0 |
280 |
Chinstrap |
52.7 |
19.8 |
197.0 |
187 |
Gentoo |
48.4 |
16.3 |
220.0 |
199 |
Gentoo |
50.5 |
15.9 |
225.0 |
296 |
Chinstrap |
42.4 |
17.3 |
181.0 |
184 |
Gentoo |
45.1 |
14.5 |
207.0 |
98 |
Adelie |
33.1 |
16.1 |
178.0 |
Dummy Classifier
The dummy classifier will learn the most common category in the target variable. Then, it will always predict that category.
from palmerpenguins import load_penguins
import pandas as pd
penguins = load_penguins().dropna()
penguins['species'].value_counts(normalize=True)
species
Adelie 0.438438
Gentoo 0.357357
Chinstrap 0.204204
Name: proportion, dtype: float64
In sklearn
, use DummyClassifier
.
Dummy Regressor
The dummy regressor will learn the mean of the target variable. Then, it will always predict that category.
from palmerpenguins import load_penguins
import pandas as pd
penguins = load_penguins().dropna()
penguins['bill_depth_mm'].agg("mean")
np.float64(17.164864864864867)
In sklearn
, use DummyRegressor
.
Could also consider other statistics like the median, a particular quantile, or an arbitrary constant.
Evaluation Metrics
How well do our supervised learning methods learn? How do we quantify performance?
To evaluate, we’ll use a handful of metrics. In general, these will compare the true target variable to predictions of the target variable.
Classification
In the definitions of the metrics that follow:
- \(y\) is a vector of the true (actual) values of length \(n\)
- the \(y_i\) are particular elements of this vector
- \(\hat{y}\) is a vector of the predicted values of length \(n\)
- \(\hat{y}_i\) are particular elements of this vector
- \(n\) is the number of samples, which is the same for the actual and predicted values
Accuracy (\(\text{Accuracy}\), accuracy
)
\[
\text{Accuracy}(y, \hat{y}) = \frac{1}{n} \sum_{i=1}^{n} I(y_i = \hat{y}_i)
\]
Here, \(I\) is an indicator function, for example:
\[
I(y_i = \hat{y}_i) =
\begin{cases}
1 & \text{if } y_i = \hat{y}_i \\
0 & \text{otherwise}
\end{cases}
\]
from palmerpenguins import load_penguins
import pandas as pd
penguins = load_penguins().dropna()
penguins['species'].value_counts(normalize=True)
species
Adelie 0.438438
Gentoo 0.357357
Chinstrap 0.204204
Name: proportion, dtype: float64
import numpy as np
np.mean(penguins["species"] == "Adelie")
np.float64(0.43843843843843844)
What an interesting result!
Misclassification (\(\text{Misclassification}\), misclassification
)
\[
\text{Misclassification}(y, \hat{y}) = \frac{1}{n} \sum_{i=1}^{n} I(y_i \neq \hat{y}_i)
\]
Regression Metrics
In the definitions of the metrics that follow:
- \(y\) is a vector of the true (actual) values of length \(n\)
- the \(y_i\) are particular elements of this vector
- \(\hat{y}\) is a vector of the predicted values of length \(n\)
- \(\hat{y}_i\) are particular elements of this vector
- \(n\) is the number of samples, which is the same for the actual and predicted values
- \(\bar{y} = \frac{1}{n} \sum_{i=1}^{n} y_i\)
Root Mean Squared Error (\(\text{RMSE}\), rmse
)
\[
\text{RMSE}(y, \hat{y}) = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2}
\]
pred = penguins['bill_depth_mm'].agg("mean")
np.sqrt(np.mean((penguins['bill_depth_mm'] - pred) ** 2))
np.float64(1.9662764301482418)
We prefer RMSE to MSE. Why?
Mean Absolute Error (\(\text{MAE}\), mae
)
\[
\text{MAE}(y, \hat{y}) = \frac{1}{n} \sum_{i=1}^{n} \left| y_i - \hat{y}_i \right|
\]
Mean Absolute Percentage Error (\(\text{MAPE}\), mape
)
\[
\text{MAPE}(y, \hat{y}) = \frac{1}{n} \sum_{i=1}^{n} \frac{\left| y_i - \hat{y}_i \right|}{\left| y_i \right|}
\]
Coefficient of Determination (\(R^2\), r2
)
\[
R^2(y, \hat{y}) = 1 - \frac{\sum_{i=1}^{n} (y_i - \hat{y}_i)^2}{\sum_{i=1}^{n} (y_i - \bar{y})^2}
\]
Max Error (\(\text{Max Error}\), max_error
)
\[
\text{Max Error}(y, \hat{y}) = \max(| y_i - \hat{y}_i |)
\]