Simple Methods and Metrics

First Steps for Supervised Learning

David Dalpiaz

Methods

Dummy Methods

The simplest methods that we can consider take input data, ignores any potential feature variables, and uses only the target variable to learn how to make future predictions.

These methods are not so politely called dummy methods. While they may not perform well, they are important as they establish a baseline for performance.

Remember The Penguins?

	species	bill_length_mm	bill_depth_mm	flipper_length_mm
65	Adelie	41.6	18.0	192.0
280	Chinstrap	52.7	19.8	197.0
187	Gentoo	48.4	16.3	220.0
199	Gentoo	50.5	15.9	225.0
296	Chinstrap	42.4	17.3	181.0
184	Gentoo	45.1	14.5	207.0
98	Adelie	33.1	16.1	178.0

Dummy Classifier

The dummy classifier will learn the most common category in the target variable. Then, it will always predict that category.

from palmerpenguins import load_penguins
import pandas as pd

penguins = load_penguins().dropna()
penguins['species'].value_counts(normalize=True)

species
Adelie       0.438438
Gentoo       0.357357
Chinstrap    0.204204
Name: proportion, dtype: float64

In sklearn, use DummyClassifier.

Dummy Regressor

The dummy regressor will learn the mean of the target variable. Then, it will always predict that category.

from palmerpenguins import load_penguins
import pandas as pd

penguins = load_penguins().dropna()
penguins['bill_depth_mm'].agg("mean")

np.float64(17.164864864864867)

In sklearn, use DummyRegressor.

Could also consider other statistics like the median, a particular quantile, or an arbitrary constant.

Metrics

Evaluation Metrics

How well do our supervised learning methods learn? How do we quantify performance?

To evaluate, we’ll use a handful of metrics. In general, these will compare the true target variable to predictions of the target variable.

Classification

In the definitions of the metrics that follow:

\(y\) is a vector of the true (actual) values of length \(n\)
- the \(y_i\) are particular elements of this vector
\(\hat{y}\) is a vector of the predicted values of length \(n\)
- \(\hat{y}_i\) are particular elements of this vector
\(n\) is the number of samples, which is the same for the actual and predicted values

Accuracy (\(\text{Accuracy}\), accuracy)

\[ \text{Accuracy}(y, \hat{y}) = \frac{1}{n} \sum_{i=1}^{n} I(y_i = \hat{y}_i) \]

Here, \(I\) is an indicator function, for example:

\[ I(y_i = \hat{y}_i) = \begin{cases} 1 & \text{if } y_i = \hat{y}_i \\ 0 & \text{otherwise} \end{cases} \]

from palmerpenguins import load_penguins
import pandas as pd

penguins = load_penguins().dropna()
penguins['species'].value_counts(normalize=True)

species
Adelie       0.438438
Gentoo       0.357357
Chinstrap    0.204204
Name: proportion, dtype: float64

import numpy as np
np.mean(penguins["species"] == "Adelie")

np.float64(0.43843843843843844)

What an interesting result!

Misclassification (\(\text{Misclassification}\), misclassification)

\[ \text{Misclassification}(y, \hat{y}) = \frac{1}{n} \sum_{i=1}^{n} I(y_i \neq \hat{y}_i) \]

Regression Metrics

In the definitions of the metrics that follow:

\(y\) is a vector of the true (actual) values of length \(n\)
- the \(y_i\) are particular elements of this vector
\(\hat{y}\) is a vector of the predicted values of length \(n\)
- \(\hat{y}_i\) are particular elements of this vector
\(n\) is the number of samples, which is the same for the actual and predicted values
\(\bar{y} = \frac{1}{n} \sum_{i=1}^{n} y_i\)

Root Mean Squared Error (\(\text{RMSE}\), rmse)

\[ \text{RMSE}(y, \hat{y}) = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2} \]

pred = penguins['bill_depth_mm'].agg("mean")
np.sqrt(np.mean((penguins['bill_depth_mm'] - pred) ** 2))

np.float64(1.9662764301482418)

We prefer RMSE to MSE. Why?

Mean Absolute Error (\(\text{MAE}\), mae)

\[ \text{MAE}(y, \hat{y}) = \frac{1}{n} \sum_{i=1}^{n} \left| y_i - \hat{y}_i \right| \]

Mean Absolute Percentage Error (\(\text{MAPE}\), mape)

\[ \text{MAPE}(y, \hat{y}) = \frac{1}{n} \sum_{i=1}^{n} \frac{\left| y_i - \hat{y}_i \right|}{\left| y_i \right|} \]

Coefficient of Determination (\(R^2\), r2)

\[ R^2(y, \hat{y}) = 1 - \frac{\sum_{i=1}^{n} (y_i - \hat{y}_i)^2}{\sum_{i=1}^{n} (y_i - \bar{y})^2} \]

Max Error (\(\text{Max Error}\), max_error)

\[ \text{Max Error}(y, \hat{y}) = \max(| y_i - \hat{y}_i |) \]

Supervised Learning Metrics

Which?

With so many metrics, and more to come, how do you decide which to us?

Good question!