Simulated Classification Data

To illustrate basic usage of classification methods, we will often use the make_blobs function from sklearn. Unlike most real data we will use, when simulating data, we will work directly with numpy arrays rather than pandas data frames.

import numpy as np
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
# simulate data
X, y = make_blobs(
    n_samples=100,
    n_features=2,
    cluster_std=3.75,
    centers=3,
    random_state=42,
)

n_samples

n_features

centers

cluster_std

Additional parameters and their usage are explained in the sklearn documentation.

X = np.round(X, 2)
X.shape, X.dtype
((100, 2), dtype('float64'))
print(X[:10, ])
[[-10.06 -12.56]
 [  7.69  -2.64]
 [ -4.27  11.05]
 [  9.9   -3.28]
 [ -5.54  -4.78]
 [  0.93  -0.15]
 [ -1.27  12.67]
 [  3.82   3.31]
 [  5.     5.61]
 [ -6.83  10.42]]
y.shape, y.dtype
((100,), dtype('int64'))
print(y)
[2 1 0 1 2 1 0 1 1 0 0 2 2 0 0 2 2 0 2 2 0 2 2 0 0 0 1 2 2 2 2 1 1 2 0 0 0
 0 1 1 2 0 1 0 0 1 2 2 2 1 1 1 0 2 2 2 0 0 1 0 2 1 2 1 2 2 1 2 1 1 1 2 2 0
 1 2 1 2 1 1 0 1 0 2 0 0 0 1 0 1 1 1 0 1 0 0 0 1 2 0]

Visualization

# initialize plot
fig, ax = plt.subplots(figsize=(8, 8))

# add points as a scatter plot
scatter = ax.scatter(
    X[:, 0],
    X[:, 1],
    c=y,
    s=75,
    edgecolor="black",
    linewidths=0.5,
    cmap=ListedColormap(["dodgerblue", "darkorange", "darkgrey"]),
)

# add legend
legend = ax.legend(
    *scatter.legend_elements(markeredgecolor="black", markeredgewidth=0.5),
    loc="upper right",
    title="Classes"
)
ax.add_artist(legend)

# add grid
ax.grid(True, linestyle="--", color="lightgrey")
ax.set_axisbelow(True)  # put grid behind the points

# show plot
plt.show()